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Abstract. Control for confounders in observational studies was gener¬ 
ally handled through stratification and standardization until the 1960s. 
Standardization typically reweights the stratum-specific rates so that 
exposure categories become comparable. With the development first of 
loglinear models, soon also of nonlinear regression techniques (logistic 
regression, failure time regression) that the emerging computers could 
handle, regression modelling became the preferred approach, just as 
was already the case with multiple regression analysis for continuous 
outcomes. Since the mid 1990s it has become increasingly obvious that 
weighting methods are still often useful, sometimes even necessary. On 
this background we aim at describing the emergence of the modelling 
approach and the refinement of the weighting approach for confounder 
control. 
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1. INTRODUCTION: CONFOUNDING AND 
STANDARDIZATION 

In this paper we survey the development of mod¬ 
ern methods for controlling for confounding in ob¬ 
servational studies, with a primary focus on discrete 
responses in demography, epidemiology and social 
science. The forerunners of these methods are the 
methods of standardization of rates , which go back 
at least to the 18th century [see Keiding (1987) 
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for a review]. These methods tackle the problem of 
comparing rates between populations with different 
age structures by applying age-specific rates to a 
single “target” age structure and, thereafter, com¬ 
paring predicted marginal summaries in this tar¬ 
get population. However, over the 20th century, the 
methodological focus swung toward indices which 
summarize comparisons of conditional (covariate- 
specific) rates. This difference of approach has, at 
its heart, the distinction between, for example, a 
ratio of averages and an average of ratios—a dis¬ 
tinction discussed at some length in the important 
papers by Yule (1934) and Kitagawa (1964), which 
we shall discuss in Section 4. The change of emphasis 
from a marginal to conditional focus led eventually 
to the modern dominance of the regression mod¬ 
elling approach in these fields. Clayton and Hills 
[(1993), page 135] likened the two approaches to 
the two paradigms for dealing with extraneous vari¬ 
ables in experimental science, namely, (a) to make 
a marginal comparison after ensuring, by random¬ 
ization , that the distributions of such variables are 
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Table 1 

Age standardization: some notation 



Study population 

Standard population 

No. of individuals 

A\ ■ ■ ■ Ak 

Si---S k 

Age distribution 

ai---ak,J2 a i = ^ 

S 1 " ' s k, ^2 s i = 1 

Death rates 

Oil ■ ■ ■ Oik 

Ai ■ ■■Afc 

Actual no. of deaths 

I] AiOi 

ESAi 

Crude death rate 

EAai/EA 



equal, and (b) to fix, or control , such influences and 
make comparisons conditional upon these fixed val¬ 
ues. In sections following, we shall chart how, in 
observational studies, statistical approaches swung 
from the former to the latter. Finally, we note that 
some recent methodological developments have re¬ 
quired a movement in the reverse direction. 

We shall start by recalling the basic concepts of 
direct and indirect standardization in the simplest 
case where a study population is to be compared to a 
standard population. Table 1 introduces some nota¬ 
tion, where there are k age groups. In indirect stan¬ 
dardization, we apply the age-specific death rates 
for the standard population to the age distribution 
of the study, yielding the counterfactual number of 
deaths in the study population if the rates had been 
the same as the standard rates. The Standardized 
Mortality Ratio (SMR) is the ratio between the ob¬ 
served number of deaths in the study population to 
this “expected” number: 

SMR = ^A i a i /^A i A i . 

Note that the numerator does not require knowledge 
of the age distribution of deaths in the study group. 
This property has often been useful. 

In direct standardization one calculates what the 
marginal death rate would have been in the study 
population if its age distribution had been the same 
as in the standard population: 

(Direct) standardized rate = E SiOLi 

This is sometimes expressed relative to the marginal 
rate in the standard population—the Comparative 
Mortality Figure (CMF): 

CMF = J] 

Sato and Matsuyama (2003) and Hernan and 
Robins (2006) gave concise and readable accounts of 


the connection of standardization to modern causal 
analysis. Assume that, as in the above simple sit¬ 
uation, outcome is binary (death) and exposure is 
binary—individuals are either exposed (study pop¬ 
ulation) or unexposed (standard population). Each 
individual may be thought of as having a different 
risk for each exposure state, even though only one 
state can be observed in practice. In addition to 
depending on exposure, risks depend on a discrete 
confounder (age group). The causal effect of the ex¬ 
posure can be defined as the ratio of the marginal 
risk in a population of individuals had they been 
exposed to the risk for the same individuals had 
they not been exposed. Conditional exchangeability 
is assumed; for a given value of the confounder (in 
the present case, within each age group), the coun¬ 
terfactual risks for each individual do not depend 
on the actual exposure status. Then the marginal 
death rate in the unexposed (standard) population 
of individuals had they been exposed is estimated 
by the directly standardized rate, so that the causal 
risk ratio for the unexposed population is estimated 
by the CMF. Similarly, the causal risk ratio for the 
exposed population is estimated by the SMR. We 
may estimate the death rate of the exposed popu¬ 
lation had they not been exposed by the indirectly 
standardized death rate , obtained by multiplying the 
crude rate in the standard population by the SMR: 


Indirect standardized rate 


Y s iY Y A i a i 
Y^i Y. A i.X j 


Both direct and indirect approaches are based on 
comparison of marginal risks, although, as pointed 
out by Miettinen (1972b), they focus on different 
“target” populations; indirect standardization may 
be said to have the study population as its target, 
while direct standardization has the standard popu¬ 
lation as its target. Indeed, the CMF is identical to 
the (reciprocal of) the SMR if “study” and “stan¬ 
dard” populations are interchanged. 

In many epidemiological and biostatistical con¬ 
texts it is natural to use the total population 
(exposed + unexposed) as basis for statements about 
causal risk ratios. With = A t + S{ , the total pop¬ 
ulation size in age group i, the causal risk ratio in 
the total population will be 


YYai _ YM N i/ A i)ai 

YNi\ t Ysm/Si) V 


This rearrangement of the formula shows that we 
may interpret standardization with the total pop¬ 
ulation as target as an inverse probability weight¬ 
ing method in which the weighting compensates for 
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nonobservation of the counterfactual exposure state 
for each subject. In the numerator, the contributions 
of the Ai exposed subjects are inversely weighted by 
Ai/Ni, which estimates the probability that a sub¬ 
ject in age group i of the total study was observed 
in the exposed state. Similarly, in the denominator, 
the Si unexposed subjects are inversely weighted by 
the probability that a subject was observed in the 
unexposed state. The method of inverse probability 
weighting is an important tool in marginal struc¬ 
tural models and other methods in modern causal 
analysis. 

Thus, while there are obvious similarities between 
direct and indirect standardization, there are also 
important differences. In particular, when the aim 
is to compare rates in several study populations, re¬ 
versal of the roles of study and standard population 
is no longer possible and Yule (1934) pointed out 
important faults with the indirect approach in this 
context. Such considerations will lead us, eventually, 
to see indirect standardization as dependent on an 
implicit model and, therefore, as a forerunner of the 
modern conditional modelling approach. 

The plan of this paper is to present selected high¬ 
lights from the historical development of confounder 
control with focus on the interplay between marginal 
or conditional choice of target, on the one hand, and 
the role of (parametric or nonparametric) statisti¬ 
cal models on the other. Section 2 recalls the de¬ 
velopment of standardization techniques during the 
19th century. Section 3 deals with early 20th cen¬ 
tury approaches to the problem of causal inference, 
focusing particularly on the contributions of Yule 
and Pearson. Section 4 records highlights from the 
parallel development in the social sciences, focus¬ 
ing on the further development of standardization 
methods in the 20th century—largely in the social 
sciences. Section 5 deals with the important devel¬ 
opments in the 1950s and early 1960s surrounding 
the analysis of the 2x2 x K contingency table, and 
Section 6 briefly summarizes the subsequent rise and 
dominance of regression models. Section 7 points 
out that the values of parameters in (conditional) 
probability models are not always the only focus of 
analysis, that marginal predictions in different tar¬ 
get populations are often important, and that such 
predictions require careful examination of our as¬ 
sumptions. Finally, Section 8 contains a brief con¬ 
cluding summary. 

Here we have used the word “rate” as a synonym 
for “proportion”, reflecting usage at the time. It was 


later recognized that a distinction should properly 
be made (Elandt-Johnson, 1975, Miettinen, 1976a) 
and modern usage reflects this. However, for this his¬ 
torical review it has been more convenient to follow 
the older terminology. 

2. STANDARDIZATION OF MORTALITY 
RATES IN THE 19TH CENTURY 

Neison's Sanatory Comparison of Districts 

It is fair to start the description of direct and 
indirect standardization with the paper by Neison 
(1844), read to the Statistical Society of London on 
15 January 1844, responding to claims made at the 
previous meeting (18 December 1843) of the Society 
by Chadwick (1844) about “representing the dura¬ 
tion of life”. 

Chadwick was concerned with comparing mortal¬ 
ity “amongst different classes of the community, and 
amongst the populations of different districts and 
countries”. He began his article by quoting the 18th 
century practice of using “proportions of death” 
(what we would now call the crude death rate): the 
simple ratio of number of deaths in a year to the 
size of the population that year. Under the Enlight¬ 
enment age assumption of stationary population, it 
is an elementary demographic fact that the crude 
death rate is the inverse of the average life time in 
the population, but as Chadwick pointed out, the 
stationarity assumption was not valid in England at 
the time. Instead, Chadwick proposed the average 
age of death (i.e., among those dying in the year 
studied). Neison responded: 

That the average age of those who die in 
one community cannot be taken as a test 
of the value of life when compared with 
that in another district is evident from the 
fact that no two districts or places are un¬ 
der the same distribution of population as 
to ages. 

To remedy this, Neison proposed to not only cal¬ 
culate the average age at death in each district, but 

also what would have been the average age 
at death if placed under the same popu¬ 
lation as the metropolis. 

This is what we now call direct standardization , 
referring the age-specific mortality rates in the var¬ 
ious districts to the same age distribution. A little 
later Neison remarked that 
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Another method of viewing this question 
would be to apply the same rate of mor¬ 
tality to different populations, 

what we today call indirect standardization. 

Keiding (1987) described the prehistory of indi¬ 
rect standardization in 18th century actuarial con¬ 
texts; although Neison was himself an actuary, we 
have found no evidence that this literature was 
known to Neison, who apparently developed direct 
as well as indirect standardization over Christmas 
1843. Schweber (2001, 2006) [cf. Bellhouse (2008)] 
attempted a historical-sociological discussion of the 
debate between Chadwick and Neison. 

A few years later Neison (1851) published an elab¬ 
orate survey “On the rate of mortality among per¬ 
sons of intemperate habits” in which he wrote in the 
typical style of the time: 

From the rate of sixteen upwards, it will 
be seen that the rate of mortality exceeds 
that of the general population of England 
and Wales. In the 6111.5 years of life to 
which the observations extend, 357 deaths 
have taken place; but if these lives had 
been subject to the same rate of mortal¬ 
ity as the population generally, the num¬ 
ber of deaths would only have been 110, 
showing a difference of 3.25 times. ... If 
there be anything, therefore, in the us¬ 
ages of society calculated to destroy life, 
the most powerful is certainly the use of 
strong drink. 

In other words, an SMR of 3.25. 

Expected numbers of deaths (indirect standard¬ 
ization) were calculated in the English official sta¬ 
tistical literature, particularly by W. Farr, for ex¬ 
ample, Farr (1859), who chose the standard mor¬ 
tality rates as the annual age-specific death rates 
for 1849-1853 in the “healthy districts”, defined as 
those with average crude mortality rates of at most 
17/1000 [see Keiding (1987) for an example]. W. 
Ogle initiated routine use of (direct) standardization 
in the Registrar-General’s report of 1883, using the 
1881 population census of England and Wales as the 
standard. In 1883, direct standardization of official 
mortality statistics was also started in Hamburg by 
G. Koch. Elaborate discussions on the best choice 
of an international standard age distribution took 
place over several biennial sessions of the Interna¬ 
tional Statistical Institute; cf. Korosi (1892-1893), 
Ogle (1892) and von Bortkiewicz (1904). 


Westergaard and Indirect Standardization 

Little methodological refinement of the standard¬ 
ization methods seems to have taken place in the 
19th century. One exception is the work by the Dan¬ 
ish economist and statistician H. Westergaard, who 
already in his first major publication, Westergaard 
(1882) (an extension, in German, of a prize paper 
that he had submitted to the University of Copen¬ 
hagen the year before), carefully described what he 
called die Methode der erwartungsmassig Gestorbe- 
nen (the method of expected deaths), that is, indi¬ 
rect standardization. He was well aware of the dan¬ 
ger that other factors could distort the result from 
a standardization by age alone and illustrated in a 
small introductory example the importance of what 
we would nowadays call confounder control, and how 
the method of expected number of deaths could be 
used in this connection. 

Table 2 shows that when comparing the mortality 
of medical doctors with that of the general popula¬ 
tion, it makes a big difference whether the calcula¬ 
tion of expected number of deaths is performed for 
the country as a whole or specifically (we would say 
“conditionally”) for each urbanization stratum. In 
Westergaard’s words, our English translation: 

It is seen from this how difficult it is to 
conduct a scientific statistical calculation. 

The two methods both look correct, and 
still yield very different results. Accord¬ 
ing to one method one would conclude 
that the medical professionals live under 
very unhealthy conditions, according to 
the other, that their health is relatively 
good. 

The difficulty derives from the fact that 
there exist two causes : the medical pro¬ 
fession and the place of residence; both 
causes have to be taken into account, and 
if one neglects one of them, the place of 
residence, and only with the help of the 
general life table considers the influence 
of the other, one will make an erroneous 
conclusion. 

The safest is to continue the stratifica¬ 
tion of the material until no further dis¬ 
ruptive causes exist; if one has no other 
proof, then a safe sign that this has been 
achieved, is that further stratification of 
the material does not change the results. 


STANDARDIZATION: A HISTORICAL PERSPECTIVE 


5 


Table 2 

Distribution of deaths of Danish medical doctors 1815-1870, as well as the expected number of deaths if the doctors had been 
subjected to the mortality of the general (male) population, based on age-specific mortality rates for Denmark as a whole as 
well as on age-specific mortality rates separately for each of the three districts Copenhagen, Provincial Towns, Rural 

Districts [Westergaard (1882), page 40] 


Expected number of deaths according to 

Years at risk Dead three special districts whole country 


Copenhagen 

7127 

108 

156 

98 

Provincial towns 

9556.5 

159 

183 

143 

Rural districts 

4213.5 

74 

53 

60 

Whole country 

20,897.0 

341 

392 

301 


This general strategy of stratifying until the theo¬ 
retical variance had been achieved, eliminating any 
residual heterogeneity beyond the basic binomial 
variation, was heavily influenced by the then current 
attempts by Quetelet and Lexis in identifying homo¬ 
geneous subgroups in data from social statistics, for 
which the normal distribution could be used, prefer¬ 
ably with the interpretation of an approximation to 
the binomial [see Stigler (1986) for an exposition on 
Quetelet and Lexis]. In his review of the book, Thiele 
(1881) criticized Westergaard’s account for overin¬ 
terpreting the role of mathematical results such as 
the law of large numbers (as the central limit theo¬ 
rem was then termed) in empirical sciences. As we 
shall see, however, Westergaard remained fascinated 
by the occurrence of binomially distributed data in 
social statistics. 

Westergaard also outlined a derivation of the stan¬ 
dard error of the expected number of deaths, using 
what we would call a Poisson approximation argu¬ 
ment similar to the famous approximation by Yule 
(1934) fifty years later for the standard error of the 
SMR. We shall see later that Kilpatrick (1962) had 
the last word on this matter by justifying Yule’s 
approximation in the framework of maximum likeli¬ 
hood estimation in a proportional hazards model. 

Standard error considerations accompany the 
many concrete calculations on human mortality 
throughout Westergaard’s book from 1882, which 
in our view is original in its efforts to integrate sta¬ 
tistical considerations of uncertainty into mortality 
analysis, with indirect standardization as the central 
tool. In the second edition of the book, Westergaard 
[(1901), page 25] explained that the method of ex¬ 
pected number of deaths has (our translation) 

the advantage of summarizing many small 

series of observations with all their ran¬ 


dom differences without having to aban¬ 
don the classification according to age 
or other groupings (e.g., occupation, resi¬ 
dence etc.), in other words obtaining the 
advantage of an extensive material, with¬ 
out having to fear its disadvantages. 

When Westergaard (1916) finally presented his 
views on statistics in English, the printed comments 
in what we now call JASA were supplemented by a 
detailed review by Edgeworth (1917) for the Royal 
Statistical Society. Westergaard [(1916), page 246] 
had gone as far as to write: 

In vital or economic statistics most num¬ 
bers have a much wider margin of devia¬ 
tion than is experienced in games. Thus 
the death rate, the birth rate, the mar¬ 
riage rate, or the relative frequency of sui¬ 
cide fluctuates within wide limits. But it 
can be proved that, by dividing the obser¬ 
vations, sooner or later a marked tendency 
to the binomial law is revealed in some 
parts of the observations. Thus, the birth 
rate varies greatly from year to year; but 
every year nearly the same ratio between 
boys and girls, and the same proportions 
of stillbirths, and of twins are observed ... 

and (page 248) 

... there is no difficulty in getting sev¬ 
eral important results concerning relative 
numbers. The level of mortality may be 
very different from year to year, but we 
can perceive a tendency to the binomial 
law in the relative numbers, the death 
rates by age, sex, occupation etc. 
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Edgeworth questioned that “Westergaard’s pa¬ 
nacea” would work as a general remedy in all sit¬ 
uations, and continued: 

It never seems to have occurred to him 
that the “physical” as distinguished from 
the “combinatorial” distribution, to use 
Lexis’ distinction, may be treated by the 
law of error [the normal distribution], 

Edgeworth here referred to the empirical (physi¬ 
cal) variance as opposed to the binomial (combina¬ 
torial). Lexis (1876), in the context of time series of 
rates, had defined what we now call the overdisper¬ 
sion ratio between these two. 

Indirect standardization does not require the age 
distribution of the cases Regarding standardization, 
Westergaard [(1916), page 261 ff.] explained and ex¬ 
emplified the method of expected number of deaths, 
as usual without quoting Neison or other earlier 
users of that method, such as Farr, and went on: 

English statisticians often use a modifica¬ 
tion of the method just described of cal¬ 
culating expected deaths; viz., the method 
of “standards” (in fact the method of ex¬ 
pected deaths can quite as well claim the 
name of a “standard” method), 

and after having outlined direct standardization 
concluded, 

In the present case the two forms of com¬ 
parison lead to nearly the same result, and 
this will generally be the case, if the age 
distribution in the special group is not 
much different from that of the general 
population. But on the whole the method 
described last is a little more complicated 
than the calculation of expected deaths, 
and in particular not applicable, if the age 
distribution of the deaths of the barristers 
and solicitors is unknown. 

This last point (that indirect standardization does 
not require the breakdown of cases in the study pop¬ 
ulation by age) has often been emphasized as an im¬ 
portant advantage of indirect standardization. An 
interesting application was the study of the emerg¬ 
ing fall of the birth rate read to the Royal Statis¬ 
tical Society in December 1905 by Newsholme and 
Stevenson (1906) and Yule (1906). [Yule (1920) later 


presented a concise popular version of the main hnd- 
ings to the Cambridge Eugenics Society, still inter¬ 
esting reading.] The problem was that English birth 
statistics did not include the age distribution of the 
mother, and it was therefore recommended to use 
some standard age-specific birth rates (here: those 
of Sweden for 1891) and then indirect standardiza¬ 
tion. 

Westergaard and an Early Randomised Clinical 
Trial 

Westergaard (1918) published a lengthy rebuttal 
(“On the future of statistics”) to Edgeworth’s cri¬ 
tique. Westergaard was here mainly concerned with 
the statistician’s overall ambition of contributing to 
“find the causality”, and with a main point being his 
criticism of “correlation based on Bravais’s formula” 
as not indicating causality. However, he also had an 
interesting, albeit somewhat cryptic, reference to a 
topic that was to become absolutely central in the 
coming years: that simple binomial variation is jus¬ 
tified under random sampling. In his 1916 paper, he 
had advocated (page 238) that 

in many cases it will be practically impos¬ 
sible to do without representative statis¬ 
tics. 

[Edgeworth (1917) taught Westergaard that the 
correct phrase was “sampling”, and Westergaard 
replied that English was for him a foreign language.] 
To illustrate this, Westergaard [(1916), page 245] 
wrote: 

The same formula in a little more com¬ 
plicated form can be applied to the chief 
problem in medical statistics; viz., to find 
whether a particular method of treatment 
of disease is effective. Let the mortality 
of patients suffering from the disease be 
P 2 , when treated with a serum, pi, when 
treated without it, and let the numbers in 
each case be ri 2 and n\. Then the mean 
error of the difference between the fre¬ 
quencies of dying in the two groups will 
be \Jp\q\ln\ + P 2 </ 2/ n 2 and we can get 
an approximation by putting the observed 
relative values instead of p\ and P 2 ■ 

In his rebuttal, Westergaard [(1918), page 508] re¬ 
vealed that this was not just a hypothetical example: 

A very interesting method of sampling 
was tried several years ago in a Danish 
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hospital for epidemic diseases in order to 
test the influence of serum on patients suf¬ 
fering from diphtheria. Patients brought 
into the hospital one day were treated 
with serum, the next day’s patients got 
no injection, and so on alternately. Here 
in all probability the two series of obser¬ 
vations were homogeneous. 

Westergaard here referred to the experiment by 
Fibiger (1898), discussed by Hrobjartsson, Gptzsche 
and Gluud (1998), as “the first randomized clinical 
trial” and further documented in the James Lind Li¬ 
brary: http://www .jameslindlibrary.org/illustrating/ 
records / om-serumbehandling-af-difteri-on- 
treatment-of-diphtheria-with-s/key .passages. 

3. ASSOCIATION, AND CAUSALITY: YULE, 
PEARSON AND FOLLOWING 

The topic of causality in the early statistical liter¬ 
ature is particularly associated with Yule and with 
Pearson, although they were far from the first to 
grapple with the problem. Yule considered the topic 
mainly in the context of discrete data, while Pearson 
considered mainly continuous variables. It is per¬ 
haps this which led to some dispute between them, 
particularly in regard to measures of association. 
For a detailed review of their differences, see Aldrich 
(1995). 

Yule’s Measures of Association and Partial 
Association 

For a 2 x 2 table with entries a, b, c, d, Yule (1900) 
defined the association measure Q = (ad — bc)/(ad + 
be), noting that it equals 0 under independence and 
1 or — 1 under complete association. There are of 
course many choices of association measure that 
fulfil these conditions. Pearson [(1900), pages 14- 
18] immediately made strong objections to Yule’s 


choice; he wanted a parameter that agreed well 
with the correlation if the 2x2 table was gener¬ 
ated from an underlying bivariate normal distribu¬ 
tion. The discussion between Yule and Pearson and 
their camps went on for more than a decade. It was 
chronicled from a historical-sociological viewpoint 
by MacKenzie (MacKenzie, 1978, 1981). 

That he regarded the concrete values of Q mean¬ 
ingful outside of 0 or 1 is illustrated by his anal¬ 
ysis of the association between smallpox vaccina¬ 
tion and attack, as measured by Q, in several towns 
(Table 3). The values of Q were much higher for 
young children than for older people, but did not 
vary markedly between different towns, despite con¬ 
siderable variation in attack rates. This use of Q is 
different from an immediately interpretable popula¬ 
tion summary measure and it is closer to how we use 
models and parameters today. Indeed, since Q is a 
simple transformation of the odds ratio, (ad)/(be), 
Yule’s analyses of association anticipate modern or¬ 
thodoxy (Q = 0.9 corresponds to an odds ratio of 
19, and Q = 0.5 to an odds ratio of 3). 

Yule’s view on causal association was largely ex¬ 
pounded by consideration of its antithesis, which 
he termed “illusory” or “misleading” association. 
Chief amongst the reasons for such noncausal as¬ 
sociation he identified as that due to the direct ef¬ 
fect of a third variable on outcome. His discussion 
of this phenomenon in Yule (1903) (under the head¬ 
ing “On the fallacies that may be caused by the 
mixing of distinct records”) and, later, in his 1911 
book (Yule, 1911) came to be termed “Yule’s para¬ 
dox” , describing the situation in which two variables 
are marginally associated but not associated when 
examined in subgroups in which the third, causal, 
variable is held constant. The idea of measuring the 
strength of association holding further variables con¬ 
stant, which Yule termed “partial” association, was 
thus identified as an important protection against 


Table 3 

Yule’s analysis of the association between smallpox vaccination and attack rates (defined as percentage contracting the 

disease in “invaded household”) 


Town 

Date 

Attack rate under 10 

Attack rate over 10 

Yule 

’s Q 

Vaccinated 

Unvaccinated 

Vaccinated 

Unvaccinated 

<10 

>10 

Sheffield 

1887-1888 

7.9 

67.6 

28.3 

53.6 

0.92 

0.49 

Warrington 

1892-1893 

4.4 

54.5 

29.9 

57.6 

0.93 

0.52 

Dewsbury 

1891-1892 

10.2 

50.8 

27.7 

53.4 

0.80 

0.50 

Leicester 

1892-1893 

2.5 

35.3 

22.2 

47.0 

0.91 

0.51 

Gloucester 

1895-1896 

0° 

oo 

46.3 

32.2 

50.0 

0.80 

0.36 
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fallacious causal explanations. However, he did not 
formally consider modelling these partial associa¬ 
tions. Indeed, he commented (Yule, 1900): 

The number of possible partial coefficients 
becomes very high as soon as we go be¬ 
yond four or five variables. 

Yule did not discuss more parsimonious definitions 
of partial association, although clearly he regarded 
the empirical stability of Q over different subgroups 
of data as a strong point in its favour. Commenting 
on some data on recovery from smallpox, in Yule 
(1912), he later wrote: 

This, as it seems to me, is a most im¬ 
portant property ... If you told any man 
of ordinary intelligence that the associa¬ 
tion between treatment and recovery was 
low at the beginning of the experiment, 
reached a maximum when 50 per cent, of 
the cases were treated and then fell off 
again as the proportion of cases treated 
was further increased, he would, I think, 
be legitimately puzzled, and would require 
a good deal of explanation as to what 
you meant by association. ... The associ¬ 
ation coefficient Q keeps the same value 
throughout, quite unaffected by the ratio 
of cases treated to cases untreated. 

Pearson and Tocher’s Test for Identity of Two 
Mortality Distributions 

Pearson regarded the theory of correlation as of 
fundamental importance, even to the extent of re¬ 
placing “the old idea of causality” (Pearson, 1910). 
Nevertheless, he recognised the existence of “spu¬ 
rious” correlations due to incorrect use of indices 
or, later, due to a third variable such as race 
(Pearson, Lee and Bramley-Moore, 1899). 

Although most of Pearson’s work concerned cor¬ 
relation between continuous variables, perhaps the 
most relevant to our present discussion is his work, 
with J. F. Tocher, on comparing mortality distribu¬ 
tions. Pearson and Tocher (1915) posed the question 
of finding a proper test for comparing two mortal¬ 
ity distributions. Having pointed out the problems 
of comparing crude mortality rates, they considered 
comparison of standardized rates (or, rather, pro¬ 
portions). In their notation, if we denote the number 
of deaths in age group s (= 1,..., S) in the two sam¬ 
ples to be compared by d s , d! s and the corresponding 


numbers of persons at risk by o s ,a', then two age- 
standardized rates can be calculated as 


M =A^ A -i and 




where A s represent the standard population in age 
group s and A = Noting that the differ¬ 

ence between standardized rates can be expressed 
as a weighted mean of the differences between age- 
specific rates, 


M' - M = V 


A s ( d s 
A \a s 



they showed that, under the null hypothesis that 
the true rates are equal for the two groups to be 
compared, 

Var(M'-M)=£(T) P.(l-?>.)(4 + £)’ 


where p s denote the (common) age-specific binomial 
probabilities. Finally, for large studies, they advo¬ 
cated estimation of p s by ( d s + d' s )/(a s + a ' s ) and 
treating (M' — M ) as approximately normally dis¬ 
tributed or, equivalently, 

2 (M' — M) 2 

Var(M' - M) 


as a chi-squared variate on one degree of freedom 
(note that their Q 2 is not directly related to Yule’s 
Q). However, they pointed out a major problem 
with this approach; that different choices of stan¬ 
dard population lead to different answers, and that 
there would usually be objections to any one choice. 
In an attempt to resolve this difficulty, they pro¬ 
posed choosing the weights A s /A to maximise the 
test statistic and showed that the resulting Q 2 is a 
X 2 test on S degrees of freedom. This is because, as 
Fisher (1922) remarked, each age-specific 2 x 2-table 
of districts vs. survival contributes an independent 
degree of freedom to the % 2 test. 

Pearson and Tocher’s derivation of this test antici¬ 
pates the much later, and more general, derivation of 
the score test as a “Lagrange multiplier test”. How¬ 
ever, the maximized test statistic could sometimes 
involve negative weights, A s , which they described 
as “irrational”. This feature of the test makes it sen¬ 
sitive to differences in mortality in different direc¬ 
tions at different ages. They discussed the desirabil¬ 
ity of this feature and noted that it should be pos¬ 
sible to carry out the maximisation subject to the 
weights being positive but “could not see how” to 
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do this (the derivation of a test designed to detect 
differences in the same direction in all age groups 
was not to be proposed until the work of Cochran, 
nearly forty years later—see our discussion of the 
2 x 2 x K below). However, they argued that the 
sensitivity of their test to differences in death rates 
in different directions in different age groups in fact 
represented an improvement over the comparison of 
corrected, or standardized, rates since “that idea is 
essentially imperfect and does not really distinguish 
between differences in the manner of dying”. 

Further Application of the Method of Expected 
Numbers of Deaths 

As described in Section 2, Westergaard (1882) 
from the very beginning emphasised that expected 
numbers of death could be calculated according 
to any stratification, not just age. Encouraged by 
Westergaard’s (1916) survey in English, Woodbury 
(1922) demonstrated this through the example of 
infant mortality as related to mother’s age, parity 
(called here order of birth), earnings of father and 
plural births. For example, the crude death rates 
by order of births form a clear J-shaped pattern 
with nadir at third birth; assuming that only age 
of the mother was a determinant, one can calculate 
the expected rates for each order of birth, and one 
gets still a J, though somewhat attenuated, showing 
that a bit of the effect of birth order is explained by 
mother’s age. Woodbury did not forget to warn: 

Since it is an averaging process the method 
will yield satisfactory results only when an 
average is appropriate. 

Stouffer and Tibbitts (1933) followed up by point¬ 
ing out that in many situations the calculations of 
expected numbers for x 2 tests would coincide with 
the “Westergaard method”. 

4. STANDARDIZATION IN THE 20TH 
CENTURY 

Although, as we have seen, standardisation meth¬ 
ods were widely used in the 19th century, it was 
in the 20th century that a more careful examina¬ 
tion of the properties of these methods was made. 
Particularly important are the authoritative reviews 
by Yule (1934) and, thirty years later, by Kitagawa 
(Kitagawa, 1964, 1966). Both these authors saw the 
primary aim as being the construction of what Yule 
termed “an average ratio of mortalities”, although 
Yule went on to remark: 


in Annual Reports and Statistical Re¬ 
views the process is always carried a stage 
further, viz. to the calculation of a “stan¬ 
dardized death-rate”. This extension is re¬ 
ally superfluous, though it may have its 
conveniences 

(the standardized rate in the study population be¬ 
ing constructed by multiplying the crude rate in the 
standard population by the standardized ratio of 
rates for the study population versus the standard 
population). 

Ratio of Averages or Average of Ratios? 

Both Yule and Kitagawa noted that central to the 
discussion was the consideration of two sorts of in¬ 
dices. The first of these, termed a “ratio of aver¬ 
ages” by Yule, has the form J2 w i x i/^2 w iyii while 
the second, which he termed an “average of ratios”, 
has the form Yl w i( x i/yi)/■ Kitagawa noted 
that economists would describe the former as an 
“aggregative index” and the latter as an “average 
of relatives”. 

Both authors pointed out that, although the two 
types of indexes seem to be doing rather different 
things, it is somewhat puzzling that they are alge¬ 
braically equivalent—we only have to write w* = 
WiUi. It is important to note, however, that the al¬ 
gebraic equivalence does not mean that a given in¬ 
dex is equally interpretable in either sense. Thus, for 
the index to be interpretable as a ratio of averages, 
the weights Wi must reflect some population distri¬ 
bution so that numerator and denominator of the 
index represent marginal expectations in the same 
population. Alternatively, to present the average of 
the age-specific ratios, Xi/yt , as a single measure of 
the age-specific effect would be misleading if they 
were not reasonably homogeneous. Kitagawa con¬ 
cluded: 

the choice between an aggregative index 
and an average of relatives in a mortality 
analysis, for example, should be made on 
the basis of whether the researcher wants 
to compare two schedules of death rates 
in terms of the total number of deaths 
they would yield in a standard population 
or in terms of the relative (proportionate) 
differences between corresponding specific 
rates in the two schedules. Both types of 
index can be useful when correctly applied 
and interpreted. 
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Here Kitagawa very clearly defined the distinction 
between what we, in the Introduction, termed the 
marginal and the conditional targets. Immediately 
after this definition, she hastened to point out that: 

It must be recognized at the outset, how¬ 
ever, that no single summary statistic can 
be a substitute for a detailed comparison 
of the specific rates in two or more sched¬ 
ules of rates. 

On the matter of averaging different ratios, Yule 
(1934) started his paper with the example of com¬ 
paring the death rates for England and Wales for 
1901 and 1931. His Table I contains these for both 
sexes in 5-year age groups and he commented: 

... the rates have fallen at all ages up to 
75 for males and 85 for females. At the 
same time the amount of the fall is very 
different at different ages, apart even from 
the actual rise in old age. The problem is 
simply to obtain some satisfactory form of 
average of all the ratios shown in columns 
4 and 7, an average which will measure 
in summary form the general fall in mor¬ 
tality between the two epochs, just as an 
index-number measures the general fall or 
rise in prices. 

So far, there is no requirement for these ratios to 
be similar. However, when describing indirect stan¬ 
dardisation, Yule [(1934), page 12] pointed out that 

if ... all the ratios of sub-rates are the 
same, no variation of weighting can make 
any difference, 

and warned (page 13), 

and perhaps it may be remarked that 
... if the ratios m ur /m sr are very different 
in different age groups, any comparative 
mortality figure becomes of questionable 
value. 

The issue of constancy of ratios was picked up in 
the printed discussion of the paper [Yule, 1934, page 
76] by Percy Stokes, seconder of vote of thanks: 

Those of us who have taught these meth¬ 
ods to students have been accustomed to 
point out that they lead to identical re¬ 
sults when the local rates bear to the stan¬ 
dard rates the same proportion at every 
age. 


Comparability of Mortality Ratios 

Yule noted that, particularly in official mortality 
statistics, standardisation is applied to many differ¬ 
ent study populations so that, as well as the stan¬ 
dardized ratio of mortality in each study popula¬ 
tion to the standard population being meaningful 
in its own right, the comparison of the indices for 
two study populations should also be meaningful. 
He drew attention to the fact that the ratio of two 
seemingly legitimate indices is not necessarily itself 
a legitimate index. He concluded that either type of 
index could legitimately be used either if the same 
weights Wi are used across study populations (for 
ratios of averages) or if the same w* are used (for 
averages of ratios). 

Denoting a standardized ratio for comparing 
study groups A and B with standard by s R a 
and s R b , respectively, Yule suggested that s R a / s R b 
should be a legitimate index of the ratio of mortali¬ 
ties in population A to that in population B. He also 
suggested that, ideally, a R b = s R a / s R b but noted 
that, whereas the CMF of direct standardisation ful¬ 
fills the former criterion, no method of standardisa¬ 
tion hitherto suggested fulfilled this more stringent 
criterion. Indirect standardisation fulfils neither cri¬ 
terion and Yule judged it to be “hardly a method of 
standardisation at all”. 

Yule’s paper is also famous for its derivation of 
standard errors of comparative mortality figures; for 
the particular case of the SMR, we have 

SMR = Observed/Expected, O/E 

and 

S.E.(SMR) » VO/E. 

As noted earlier, this was already derived by West- 
ergaard (1882), although this was apparently not 
generally known. 

A final matter occupying no less than twelve pages 
of Yule (1934) is the discussion of a context-free av¬ 
erage, termed by Yule his C 3 method or the equiv¬ 
alent average death rate , which is just the simple 
average of all age-specific death rates. This quantity 
could also be explained as the death rate standard¬ 
ized to a population with equal numbers in each 
age group. As we shall see below, it was further 
discussed by Kilpatrick (1962) and rediscovered by 
Day (1976) in an application to cancer epidemiology. 
In modern survival analysis it is called the cumula¬ 
tive hazard and estimated nonparametrically by the 
Nelson-Aalen estimator (Nelson, 1972, Aalen, 1978, 
Andersen et ah, 1993). 
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Elaboration: Rosenberg’s Test Factor 
Standardisation 

During World War II, the United States Army 
established a Research Branch to investigate prob¬ 
lems of morale, soldier preferences and other issues 
to provide information that would allow the mili¬ 
tary to make sensible decisions on practical issues 
involved in army life. To formalize some of the tools 
used in that generally rather practical research, il¬ 
lustrated with concrete examples from that work, 
Kendall and Lazarsfeld (1950) introduced and dis¬ 
cussed the terminology of elaboration: A statistical 
relation has been established between two variables, 
one of which is assumed to be the cause, the other 
to be the effect. The aim is to further understand 
that relation by introducing a third variable (called 
test factor) related to the “cause” as well as the “ef¬ 
fect” . Kendall and Lazarsfeld carefully distinguished 
between antecedent and intervening test variables, 
depending on the temporal order of the “cause” and 
the test variables. If the population is stratified ac¬ 
cording to an antecedent test factor, and the par¬ 
tial relationships between the two original variables 
then vanish, the relation between “cause” and “ef¬ 
fect” has been explained through their relations to 
the test variable, which is then termed spurious. If 
the association between cause and effect disappears 
(is reduced) by controlling on the intervening vari¬ 
able, Kendall and Lazarsfeld talk about complete 
(partial) interpretation of the original two-factor re¬ 
lationship. 

We note that interpretation has gone out of use 
at least in epidemiological applications and in most 
of modern causal inference where the focus is on ob¬ 
taining an undiluted measure of the causal effect of 
the “cause”, not diluting this effect by conditioning 
on variables on the causal pathway from cause to ef¬ 
fect [see Pearl (2001) or Petersen, Sinisi and van der 
Laan (2006)]. Instead, a general area of Mediation 
Analysis has grown up; see MacKinnon (2008), Sec¬ 
tion 1.8, for a useful historical survey. 

Rosenberg (1962) used standardization to obtain 
a single summary measure from all the partial (i.e., 
conditional) associations resulting from the stratifi¬ 
cation in an elaboration. Rosenberg’s famous exam¬ 
ple was a study of the possible association between 
religious affiliation and self-esteem for high school 
students, controlling for (all combinations of) fa¬ 
ther’s education, social class identification and high 
school grades. Thus, this is an example of interpreta¬ 
tion by conditioning on variables that might mediate 


an effect of religious affiliation on sons’ self-esteem. 
The crude association showed higher self-esteem for 
Jews than for Catholics and Protestants; by stan¬ 
dardizing on the joint distribution of the three co¬ 
variates in the total population this difference was 
halved. 

Rosenberg emphasized that in survey research the 
end product of the standardisation exercise is not a 
single rate as in demography, but: 

In survey research, however, we are inter¬ 
ested in total distributions. Thus, if we ex¬ 
amine the association between X and Y 
standardizing on Z, we must emerge with 
a standardised table (of the joint distribu¬ 
tion of X and Y) which contains all the 
cells of the original table. 

Rosenberg indicated shortcuts to avoid repeating 
the same calculations when calculating the entries 
of this table. 

The Peters-Belson Approach 

This technique (Belson (1956), Peters (1941)) was 
developed for comparing an experimental group 
with a control group in an observational study on 
some continuous outcome. The proposal is to regress 
the outcome on covariates only in the control group 
and use the resulting regression equation to predict 
the results for the experimental group under the as¬ 
sumption of no difference between the groups. A 
simple test of no differences concludes the analysis. 
Cochran (1969) showed that under some assump¬ 
tions of (much) larger variance in the experimental 
group than the control group this technique might 
yield stronger inference than standard analysis of 
covariance, and that it will also be robust to cer¬ 
tain types of effect modification. The technique has 
recently been revived by Graubard, Rao and Gast- 
wirth (2005). 

Decomposition of Crude Rate Differences and 
Ratios 

Several authors have suggested a decomposition 
of a contrast between two crude rates into a com¬ 
ponent due to differences between the age-specific 
rates and a component due to differences between 
the age structures of the two populations. 

Kitagawa (1955) proposed an additive decompo¬ 
sition in which the difference in crude rates is ex¬ 
pressed as a sum of (a) the difference between the 
(direct) standardized rates, and (b) a residual due to 
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the difference in age structure. Rather than treating 
one population as the standard population and the 
second as the study population, she treated them 
symmetrically, standardising both to the mean of 
the two populations’ age structures: 

Crude rate (study) — Crude rate (standard) 

— ^ ^ diCXi ^ ^ SjAj 

E / , + . a* + Aj 

K - K) — 2 — + Z^ ai ~ Si > —2—' 

The first term contrasts the standardized rates while 
the second contrasts the age structures. 

However, ratio comparisons are more frequently 
employed when contrasting rates and several au¬ 
thors have considered a multiplicative decomposi¬ 
tion in which the ratio of crude rates is expressed 
as the product of a standardized rate ratio and 
a factor reflecting the effect of the different age 
structures. Such a decomposition, in which the age- 
standardized measure is the SMR, was proposed by 
Miettinen (1972b): 

Crude rate (study) JO a i a i 

Crude rate (standard) JO s i^i 

_JJ djotj ^ y djXi 

JJO'jAj J~~] SjXj 

The first term is the SMR and the second, which 
reflects the effect of the differing age structures, Mi¬ 
ettinen termed the “confounding risk ratio”. 

Kitagawa (1964) had also proposed a multiplica¬ 
tive decomposition which, as in her additive decom¬ 
position, treated the two populations symmetrically. 
Here, the standardized ratio measure was inspired 
by the literature on price indices in economics. If, in 
a “base” year, the price of commodity i is po* and the 
quantity purchased is qot and, in year t the equiva¬ 
lent values are pu and qu, then an overall compar¬ 
ison of prices requires adjustment for differing con¬ 
sumption patterns. Simple relative indices can be 
constructed by fixing consumption at base or at t. 
The former is Laspeyres’s index, J2puqoi/yPomi, 
and the latter is Paasche’s index, ypiqu/ypoiqu- 
These are asymmetric with respect to the two time 
points and this asymmetry is addressed in Fisher’s 
“ideal” index, defined as the geometric mean of 
Laspeyres’s and Paasche’s indices. Kitagawa noted 
that Laspeyres’s and Paasche’s indices are directly 
analogous to the CMF and SMR, respectively, and, 


in her symmetric decomposition, 

_ / jj Sjotj jo Oiioii ^ yp Q ,; ?dj 

JO £*A i y JO SjAj JO dj\j y JO A {Si JO CXj Sj 

the first term is an “ideal” index formed by the geo¬ 
metric mean of the CMF and SMR, and the second 
term is: 

the geometric mean of two indexes sum¬ 
marizing differences in /-composition; one 
an aggregative index using the /-specific 
rates of the base population as weights, 
and the second an aggregative index using 
the /-specific rates of the given population 
as weights. 

The paper by Kitagawa (1955) concluded with a 
detailed comparison to the “Westergaard method” 
as documented by Woodbury (1922). Woodbury’s 
paper had also inspired Kitagawa’s contemporary 
R. H. Turner, also a Ph.D. from the University of 
Chicago, to develop an approach to additive de¬ 
composition according to several covariates (Turner, 
1949), showing how the “nonwhite-white” differ¬ 
ential in labour force participation is associated 
with marital status, household relationship and age. 
Kitagawa’s decomposition paper continues to be fre¬ 
quently cited and the technique is still included 
in current textbooks in demography [e.g., Preston, 
Heuveline and Guillot (2001)]. There has been a 
considerable further development of additive decom¬ 
position ideas; for recent reviews see Chevan and 
Sutherland (2009) for the development in demogra¬ 
phy and Powers and Yun (2009) for decomposition 
of hazard rate models and some references to de¬ 
velopments in econometrics and to some extent in 
sociology. We return in Section 6 to the connection 
with the method of “purging” suggested by C. C. 
Clogg. 

5. ODDS RATIOS AND THE 2x2 x K 
CONTINGENCY TABLE 

Case-Control Studies and the Odds Ratio 

Although the case-control study has a long his¬ 
tory, its use to provide quantitative measures of 
the strength of association is more recent, gener¬ 
ally being attributed to Cornfield (1951). Table 4 
sets out results from a hypothetical case-control 
study comparing some exposure in cases of a dis¬ 
ease with that in a control group of individuals free 
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Table 4 

Frequencies in a 2x2 contingency table derived from a 
case-control study 



Cases 

Controls 

Exposed 

A 

B 

Not exposed 

C 

D 


of the disease. In this work, he demonstrated that, 
if the disease is rare, that is, prevalence of disease 
in the population, X, is near zero and the propor¬ 
tion of cases and controls exposed are p\ and po, re¬ 
spectively, then the prevalence of disease in exposed 
subjects is, to a close approximation, Xpi/po, and 
X(1 —pi)/(l — po) in subjects not exposed. Thus, 
the ratio of prevalences is approximated by the odds 
ratio 

Pi / Po 
1 - Pi / 1 - Po ’ 

which can be estimated by (AD)/(BC). 

In this work, Cornfield discussed the problem of 
bias due to poor control selection, but did not ex¬ 
plicitly address the problem of confounding by a 
third factor. In later work Cornfield (1956) did con¬ 
sider the case of the 2x2 x K table in which the 
K strata were different case-control studies. How¬ 
ever, his analysis focussed on the consistency of the 
stratum-specific odds ratios; having excluded outly¬ 
ing studies, he, at this stage, ignored Yule’s paradox, 
simply summing over the remaining studies and cal¬ 
culating the odds ratio in the marginal 2x2 table. 

Interaction and “Simpson’s Paradox” 

Bartlett (1935) linked consistency of odds ratios 
in contingency tables with the concept of “interac¬ 
tion”. Specifically, he defined zero second order in¬ 
teraction in the 2x2x2 contingency table of vari¬ 
ables X, Y and Z as occurring when the odds ratios 
between X and Y conditional upon the level of Z 
are stable across levels of Z. (Because of the symme¬ 
try of the odds ratio measure, the roles of the three 
variables are interchangeable.) In an important and 
much cited paper, Simpson (1951) discussed inter¬ 
pretation of no interaction in the 2x2x2 table, 
noting that “there is considerable scope for para¬ 
dox” . 

If one were to read only the abstract of Simpson’s 
paper, one could be forgiven for believing that he 
had simply restated Yule’s paradox in this rather 
special case: 


it is shown by an example that vanishing 
of this second order interaction does not 
necessarily justify the mechanical proce¬ 
dure of forming the three component 2x2 
tables and testing each of these for signif¬ 
icance by standard methods 

(by “component” tables, he meant the marginal ta¬ 
bles). Thus, “Simpson’s paradox” is often identified 
with Yule’s paradox, sometimes being referred to as 
the Yule-Simpson paradox. However, the body of 
Simpson’s paper contains a much more subtle point 
about the nature of confounding. 

Simpson’s example is a table in which X and Y 
are both associated with Z, in which there is no 
second order interaction, and the conditional odds 
ratios for X versus Y are 1.2 while the marginal 
odds ratio is 1.0. He pointed out that if X is a med¬ 
ical treatment, Y an outcome and Z sex, then there 
is clearly a treatment effect—the conditional odds 
ratio provides the “right” answer, the treatment ef¬ 
fect having been destroyed in the margin by negative 
confounding by sex. Simpson compared this with an 
imaginary experiment concerning a pack of playing 
cards which have been played with by a baby in 
such a way that red cards and court cards, being 
more attractive, have become dirtier. Variables X 
and Y now denote red/black and court/plain and 
Z denotes the cleanliness of the cards. In this case, 
Simpson pointed out that the marginal table of X 
versus Y, “provides what we would call the sensible 
answer, that there is no such association”. This is, 
perhaps, the real Simpson’s paradox—the same ta¬ 
ble demonstrates Yule’s paradox when labelled one 
way but does not when it is labelled another way. 
Simpson’s paper pointed out that the causal status 
of variables is central; one can condition on causes 
when forming conditional estimates of treatment ef¬ 
fects, but not upon effects. As we shall see in the 
next section, this point is central to the problem 
of time-dependent confounding which has inspired 
much recent methodological advance. A closely re¬ 
lated issue is the phenomenon of selection bias, fa¬ 
mously discussed by Berkson (1946) in relation to 
hospital-based studies. There X and Y are observed 
only when an effect, Z (e.g., attending hospital), 
takes on a specific value. 

A further contribution of Simpson’s paper was to 
point out the “noncollapsibility” of the odds ratio 
measure in this zero interaction case; the conditional 
and marginal odds ratios between X and Y are only 
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the same if either X is conditionally independent of 
Z given Y, or Y is conditionally independent of Z 
given X. Note that these conditions may not be sat¬ 
isfied even in randomised studies—another of the 
paradoxes to which Simpson drew attention. For 
a more detailed discussion of Simpson’s paper see 
Hernan, Clayton and Keiding (2011). 

Cochran’s Analyses of the 2x2 X K Table 

In his important paper on “methods for strength¬ 
ening the common x 2 test”, Cochran (1954) pro¬ 
posed a “combined test of significance of the dif¬ 
ference in occurrence rates in the two samples” 
when “the whole procedure is repeated a number of 
times under somewhat differing environmental con¬ 
ditions”. He pointed out that carrying out the y 2 
test in the marginal table 

is legitimate only if the probability p of an 
occurrence (on the null hypothesis) can be 
assumed to be the same in all the individ¬ 
ual 2x2 tables. 

(He did not further qualify this statement in the 
light of Simpson’s insight discussed above.) He pro¬ 
posed three alternative analyses. The first of these 
was to add up the x 2 test statistics from each table 
and to compare the result with the x 2 distribution 
on K degrees of freedom. This, as already noted, is 
equivalent to Pearson and Tocher’s earlier proposal, 
but Cochran judged it a poor method since 

It takes no account of the signs of the dif¬ 
ferences (pi — po) in the two samples, and 
consequently lacks power in detecting a 
difference that shows up consistently in 
the same direction in all or most of the 
individual tables. 

The second alternative he considered was to calcu¬ 
late the “x” value for each table—the square roots 
of the x 2 statistics, with signs equal to those of the 
corresponding (p\ — po)’s—and to compare the sum 
of these values with the normal distribution with 
mean zero and variance K. He noted, however, that 
this method would not be appropriate if the sample 
sizes (the “N's v ) vary substantially between tables, 
since 

Tables that have very small IV’s cannot 
be expected to be of much use in detect¬ 
ing a difference, yet they receive the same 
weight as tables with large N's. 


He also noted that variation of the probabilities of 
outcome between tables would also adversely affect 
the power of this method: 

Further, if the p’s vary from say 0 to 50%, 
the difference that we are trying to detect, 
if present, is unlikely to be constant at all 
levels of p. A large amount of experience 
suggests that the difference is more likely 
to be constant on the probit or logit scale. 

It is clear, therefore, that Cochran considered the 
ideal analysis to be based on a model of “constant 
effect” across the tables. Indeed, when the data were 
sufficiently extensive, he advocated use of empirical 
logit or probit transformation of the observed pro¬ 
portions followed by model fitting by weighted least 
squares. Such an approach, based on fitting a formal 
model to a table of proportions, had already been 
pioneered by Dyke and Patterson (1952), and will 
be discussed in Section 6. 

In situations in which the data were not suffi¬ 
ciently extensive to allow an approach based on em¬ 
pirical transforms, Cochran proposed an alternative 
test “in the original scale”. This involved calculating 
a weighted mean of the differences d = (p\ — po) over 
tables. In our notation, comparing the prevalence of 
exposure between cases and controls, 

d . = _^± _ B > 

A i + Ci Bj + Di 

_l _ i y 1 

Ai + Ci Bi + Di) 

d=Y2widi/ Ywj. 

In calculating the variance of d, he estimated the 
variance of the di s under a binomial model using 
a plug-in estimate for the expected values of pu,poi 
under the null hypothesis: (Ai + Bi)/Ni. Cochran 
described the resulting test as performing well “un¬ 
der a wide range of variations in the N's and p’s 
from table to table”. 

A point of some interest is Cochran’s choice of 
weights which, as pointed out by Birch (1964), was 
“rather heuristic”. If this procedure had truly been, 
as Cochran described it, an analysis “in the original 
scale”, one would naturally have weighted the differ¬ 
ences inversely by their variance. But this does not 
lead to Cochran’s weights, and he provided no justi¬ 
fication for his alternative choice. A likely possibility 
is that he noted that weighting inversely by preci¬ 
sion leads to two different tests according to whether 
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we choose to compare the proportions exposed be¬ 
tween cases and control or the proportions of cases 
between exposed and unexposed groups. Cochran’s 
choice of weights avoided this embarrassment. 

Mantel and Haenszel 

Seemingly unaware of Cochran’s work, Mantel 
and Haenszel (1959) considered the analysis of the 
2x2 x K contingency table. This paper explicitly 
related the discussion to control for confounding in 
case-control studies. Before discussing this famous 
paper, however, it is interesting that the same au¬ 
thors had suggested an alternative approach a year 
earlier (Haenszel, Shimkin and Mantel, 1958). 

As in Cochran’s analysis, the idea was based on 
post-stratification of cases and controls into strata 
which are as homogeneous as possible. Arguing by 
analogy with the method of indirect standardisa¬ 
tion of rates, they suggested that the influence of 
confounding on the odds ratio could be assessed by 
calculating, for each stratum, s, the “expected” fre¬ 
quencies in the 2x2 table under the assumption of 
no partial association within strata and calculating 
the marginal odds ratio under this assumption. The 
observed marginal odds ratio was then adjusted by 
this factor. Thus, denoting the expected frequencies 
by ai,bi,Ci and <k where a* = (A* + 2?*)(A* + C^/Ni 
etc., their proposed index was 

E ^ E A / E a i E 
EAEA/ E h E c i' 

The use of the stratum-specific expected frequencies 
in this way can be regarded as an early attempt, 
in the case-control setting, to estimate what later 
became known as the “confounding risk ratio” and 
which we described in Section 4. 

In their later paper, Mantel and Haenszel (1959) 
themselves criticized this adjusted index which, they 
stated, “can be seen to have a bias toward unity” 
and does “not yield an appropriate adjusted rela¬ 
tive risk”. (Somewhat unconvincingly, they claimed 
that they had used the index fully realizing its de¬ 
ficiencies “to present results more nearly compara¬ 
ble with those reported by other investigators us¬ 
ing similarly biased estimators”!) These statements 
were not formally justified and beg the question as 
to what, precisely, is the estimand? One can only as¬ 
sume that they were referring to the case in which 
the stratum-specific odds ratios are equal and pro¬ 
vide a single estimand. This is the case in which 


Yule’s Q is stable across subgroups. The alternative 
estimator they proposed: 

E AiDi/Ni 

Y.BrCi/Ni 

is a consistent estimator of the stratum-specific odds 
ratio in this circumstance. They also proposed a test 
for association between exposure and disease within 
strata. The test statistic is the sum, across strata, 
of the differences between observed and “expected” 
frequencies in one cell of each table: 

- <*) = - (V + B.KA + C,) 

and its variance under the null hypothesis is 

(Aj + Bi)(Ci + Di)(Ai + Ci)(Bi + Di ) 

^ Nf{Ni- 1) ' 

Some algebra shows that the Mantel-Haenszel test 
statistic is identical to Cochran’s E w idi- There is 
a slight difference between the two procedures in 
that, in calculating the variance, Mantel and Haen¬ 
szel used a hypergeometric assumption to avoid the 
need to estimate a nuisance parameter in each stra¬ 
tum in the “two binomials” formulation. This results 
in the (A r t — 1) term in the above variance formula 
instead of IVj—a distinction which can become im¬ 
portant when there are a large number of sparsely 
populated strata. 

Whereas considerations of bias and, as later 
shown, optimal properties of their proposed test de¬ 
pend on the assumption of constancy of the odds 
ratio across strata, Mantel and Haenszel were at 
pains to disown such a model. They proposed that 
any standardized, or corrected, summary odds ra¬ 
tio would be some sort of weighted average of the 
stratum-specific odds ratios and identified that one 
might choose weights either by precision or by im¬ 
portance. On the former: 

If one could assume that the increased 
relative risk associated with a factor was 
constant over all subclassifications, the es¬ 
timation problem would reduce to weight¬ 
ing the several subclassification estimates 
according to their relative precisions. The 
complex maximum likelihood iterative 
procedure necessary for obtaining such a 
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weighted estimate would seem to be un¬ 
justified, since the assumption of a con¬ 
stant relative risk can be discarded as usu¬ 
ally untenable. 

They described the weighting scheme used in 
the Mantel-Haenszel estimator as approximately 
weighting by precision. Indeed, it turns out that 
these weights correspond to optimal weighting by 
precision for odds ratios close to 1.0. 

An alternative standardized odds ratio estimate, 
in the spirit of weighting and mirroring direct stan¬ 
dardisation, was proposed by Miettinen (1972a). 
This is 

EWiCi/Ds 

where the weights reflect the population distribution 
of the stratifying variable. This index can be unsta¬ 
ble when strata are sparse, but Greenland (1982) 
pointed out that it has clear advantages over the 
Mantel-Haenszel estimate when the odds ratios dif¬ 
fer between strata. This follows from our earlier dis¬ 
cussion (Section 4) of the distinction between a ratio 
of averages and an average of ratios. Since the nu¬ 
merator and denominator of the Mantel-Haenszel 
estimator do not have an interpretation in terms of 
the population average of a meaningful quantity, the 
index must be interpreted as an average of ratios, 
despite its usual algebraic representation. Thus, de¬ 
spite the protestations of Mantel and Haenszel to 
the contrary, its usefulness depends on approximate 
stability of the stratum-specific odds ratios. Green¬ 
land pointed out that Miettinen’s index has an inter¬ 
pretation as a ratio of marginal expectations of epi- 
demiologically meaningful quantities and, therefore, 
may be useful even when odds ratios are heteroge¬ 
neous. He went on to propose some improvements 
to address its instability. 

As was noted earlier, there was a widespread be¬ 
lief that controlling for confounding in case-control 
studies was largely a matter to be dealt with at 
the design stage, by appropriate “cross-matching” 
of controls to cases. Mantel and Haenszel, however, 
pointed out that such matching nevertheless needed 
to be taken account of in the analysis: 

when matching is made on a large number 
of factors, not even the fiction of a ran¬ 
dom sampling of control individuals can 
be maintained. 


They showed that the test and estimate they had 
proposed were still correct in the setting of closely 
matched studies. Despite this, misconceptions about 
matching persisted for more than a decade. 

6. THE EMERGENCE OF FORMAL MODELS 

Except for linear regression analysis for quantita¬ 
tive data, proper statistical models, in the sense we 
know today, were slow to appear for the purpose of 
what we now call confounder control. 

We begin this section with the early multiplica¬ 
tive intensity age-cohort model for death rates by 
Kermack, McKendrick and McKinlay (1934a, 1934b), 
even though it was strangely isolated as a statistical 
innovation: no one outside of a narrow circle of co¬ 
hort analysts seems to have quoted it before 1976. 
First, we must mention two precursors from the ac¬ 
tuarial environment. 

Actuarial Analyses of Cohort Life Tables 

Two papers were read to audiences of actuaries on 
the same evening: 31 January 1927. Derrick (1927), 
in the Institute of Actuaries in London, studied mor¬ 
tality in England and Wales 1841 1925, omitting the 
war (and pandemic) years 1915-1920. On a clever 
graph of age-specific mortality (on a logarithmic 
scale) against year of birth he generalized the paral¬ 
lelism of these curves to a hypothesis that mortality 
was given by a constant age structure, a decreasing 
multiplicative generation effect and no period effect, 
and even ventured to extrapolate the mortality for 
existing cohorts into the future. 

Davidson and Reid, in the Faculty of Actuaries 
in Edinburgh, first gave an exposition of estimating 
mortality rates in a Bayesian framework (posterior 
mode), including the maximum likelihood estimator 
interpretation of the empirical mortality obtained 
from an uninformative prior (Davidson and Reid, 
1926-1927). They proceeded to discuss how the 
mortality variation force might possibly depend on 
age and calendar year and arrived at a discussion 
on how to predict future mortality, where they re¬ 
marked (page 195) that this would be much easier 
if 

there is in existence a law of mortality 
which, when applied to consecutive human 
life—that is, when applied to trace indi¬ 
viduals born in a particular calendar year 
throughout the rest of their lives—gives 
satisfactory results 
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or, as we would say, if the cohort life table could be 
modelled. Davidson and Reid also explained their 
idea through a well-chosen, though purely theoreti¬ 
cal, graph. 


The Multiplicative Model of Kermack, 
McKendrick and McKinley 


Kermack, McKendrick and McKinley published 
an analysis of death-rates in England and Wales 
since 1845, in Scotland since 1860 and in Sweden 
since 1751 in two companion papers. In the substan¬ 
tive presentation in The Lancet (Kermack, McK¬ 
endrick and McKinlay, 1934a)— republished by In¬ 
ternational Journal of Epidemiology (2001) with 
discussion of the epidemiological cohort analysis 
aspects—they observed and discussed a clear pat¬ 
tern in these rates as a product of a factor only 
depending on age and a factor only depending on 
year of birth. 

The technical elaboration in the Journal of Hy¬ 
giene (Kermack, McKendrick and McKinlay, 1934b) 
started from the partial differential equation de¬ 
scribing age-time dependent population growth with 
i^t,ada denoting the number of persons at time t with 
age between a and a + da, giving the death rate at 
time t and age a 


1 ( dv t> a dv t ,a \ 

vt,a V 9t da ) 




here quoted from McKendrick (1925-1926) [cf. Keid- 
ing (2011) for comments on the history of this equa¬ 
tion], and postulate at once the multiplicative model 
for 


fit,a) = a(t - a)p a - 

The paper is largely concerned with estimation of 
the parameters and of the standard errors of these 
estimates; some attention is also given to the possi¬ 
bility of fitting the age effect j3 a to the Gompertz- 
Makeham distribution. 

This fine statistical paper was quoted very little 
in the following 45 years and thus does not seem to 
have influenced the further developments of statis¬ 
tical models in the area. One cannot avoid specu¬ 
lating what would have happened if this paper had 
appeared in a statistical journal rather than in the 
Journal of Hygiene. 1934 was the year when Yule 
had his major discussion paper on standardisation 
in the Royal Statistical Society. In all fairness, it 
should, on the other hand, be emphasised that Ker¬ 
mack et al. did not connect to the then current dis¬ 
cussions of general issues of standardisation. 


The SMR as Maximum Likelihood Estimator 

Kilpatrick (1962), in a paper based on his Ph.D. 
at Queen’s University at Belfast, specified for the 
first time a mortality index as an estimator of a pa¬ 
rameter in a well-specified statistical model -that 
in which the age-specific relative death rate in each 
age group estimates a constant, and only differs from 
it by random variation. Kilpatrick’s introduction is 
a good example of a statistical view on standard¬ 
ization, in some ways rather reminiscent of Wester- 
gaard: 

The mortality experienced by different 
groups of individuals is best compared, 
using specific death rates of sub-groups 
alike in every respect, apart from the sin¬ 
gle factor by which the total population 
is divided. This situation is rarely, if ever, 
realized and we have to be satisfied with 
mortality comparisons between groups of 
individuals alike with regard to two, three 
or four major factors known to affect the 
risk of death. 

In this paper groups are defined as aggre¬ 
gates of occupations (social classes). It is 
assumed that age is the only factor re¬ 
lated to an individual’s mortality within 
a group. This example may readily be ex¬ 
tended to other factors such as sex, mar¬ 
ital status, residence, etc. Although the 
association of social class and age-specific 
mortality may be evaluated by compar¬ 
isons between social classes, specific death 
rates of a social class are more frequently 
compared with the corresponding rates of 
the total population. It is this type of com¬ 
parison which is considered here. 

Kilpatrick then narrowed the focus to developing 
an index I us which 

should represent the “average” excess or 
deficit mortality in group u compared 
with the standard s, 

and noted that, with 9 X representing the ratio be¬ 
tween the mortality rates in age group x in the study 
group and the total population, 

Recent authors ... have shown that the 
SMR can be misleading if there is much 
variation in 9 X over the age range consid¬ 
ered. It would, therefore, seem desirable 
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to test the significance of this variation in 
9 X before placing confidence on the results 
of the SMR or any other index. ... This 
paper proposes a simple test for hetero¬ 
geneity in 9 X and shows that the SMR 
is equivalent to the maximum likelihood 
estimate of a common 9 when the 9 X do 
not differ significantly. It follows therefore 
that the SMR has a minimum standard 
error. 


Formally, Kilpatrick assumed the observed age- 
specific rates in the study group to follow Poisson 
distributions with rate parameters 9\i. The Aj’s and 
the denominators, Ai, were treated as deterministic 
constants, and the mortality ratio, 9, as a parameter 
to be estimated. 

We note that the view of standardisation as an 
estimation problem in a well-specified statistical 
model was principally different from earlier au¬ 
thors. One could refer to the paper by Kermack, 
McKendrick and McKinlay (1934b) discussed above 
(which specifies a similar model), but they did not 
explicitly see their model as being related to stan¬ 
dardisation; their paper has been quoted rarely and 
it seems that Kilpatrick was unaware of it. 

Once standardisation is formulated as an estima¬ 
tion problem, the obvious question is to find an opti¬ 
mal estimator, and Kilpatrick showed that the stan¬ 
dardized mortality ratio (SMR) 


9 = 


Observed number of deaths in the study population 
Expected number of deaths in the study population 


has minimum variance among all indices, and 
that it is the maximum likelihood estimator in 
the model specified by deterministic standard age- 
specific death rates and a constant age-specific rate 
ratio. 

Kilpatrick noted that while the SMR is, in a sense, 
optimal for comparing one study group to a stan¬ 
dard, the weights change from one study group to 
the next so that it cannot be directly used for com¬ 
paring several groups. As we have seen, this point 
had been made often before, particularly forcefully 
by Yule (1934). Kilpatrick compared the SMR to the 
comparative mortality index (CMF) obtained from 
direct standardization and to Yule’s index (the ra¬ 
tio of “equivalent death rates”, that is, direct stan¬ 
dardization using equally large age groups). He con¬ 
cluded: 


Where appropriate and possible, a test of 
heterogeneity on age-specific mortality ra¬ 
tios should precede the use of an index. 
When there is insufficient information to 
conduct the test of heterogeneity, conclu¬ 
sions based solely on the index value may 
apply to none of the individuals studied. 
Caution is strongly urged in the interpre¬ 
tation of mortality indices. 

Kalton—Statistical View of Standardisation in 
Survey Research 

Kalton (1968) surveyed, from a rather mainstream 
statistical view, standardisation as a technique for 
estimating the contrast between two groups and 
to test the hypothesis that this contrast vanishes. 
Kalton emphasized that 

... if the estimate is to be meaningful, 
there must be virtually no “interaction” 
effect in the variable under study between 
the groups and the control variable (i.e., 
there must be a constant difference in the 
group means of the variable under study 
for all levels of the control variable), but 
this requirement may be somewhat re¬ 
laxed for the significance test. 

This distinction implies that the optimal weights 
are not the same for the estimation problem and the 
testing problem. Without commenting on the causal 
structure of “elaboration” Kalton (1968) also gave 
further insightful technical statistical comments to 
Rosenberg’s example (see above) and the use of opti¬ 
mum weights for testing no effect of religious group. 

Kalton seems to have been unaware of Kilpatrick’s 
paper six years earlier, but took a similar main¬ 
stream statistical view of standardisation: that pre¬ 
sentation of a single summary measure of the within- 
stratum effect of the study variable implies a model 
of no interaction between stratum and study vari¬ 
able. 

Indirect Standardisation without External 
Standard 

Kilpatrick had opened the way to a fully model- 
based analysis of rates in lieu of indirect standardis¬ 
ation, and authoritative surveys based on this ap¬ 
proach were indeed published by Holford (1980), 
Hobcraft, Menken and Preston (1982), Breslow et al. 
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(1983), Borgan (1984) and Hoem (1987). Still, mod¬ 
ified versions of the old technique of indirect stan¬ 
dardization remained part of the tool kit for many 
years. 

An interesting example is the attempt by Man¬ 
tel and Stark (1968) to standardize the incidence of 
mongolism for both birth order and maternal age. 
Standardized for one of these factors, the incidence 
still increased as function of the other, but the au¬ 
thors felt it 

desirable to obtain some simple descrip¬ 
tive statistics by which the reader could 
judge for himself what the data showed. 

... What was required was that we deter¬ 
mine simultaneously a set of birth-order 
category rates which when used as a stan¬ 
dard set gave a set of indirect-adjusted 
maternal-age category rates which in turn, 
when used as a standard set, implied the 
original set of birth-order category rates. 

The authors achieved that through an iterative 
procedure, which always converged to “indirect, un¬ 
confounded” adjusted rates, where the convergent 
solutions varied with the initial set of standard rates, 
although they all preserved the ratios of the various 
birth-order category-adjusted rates and the ratios 
of the various maternal-age category-adjusted rates. 
Osborn (1975) and Breslow and Day (1975) formu¬ 
lated multiplicative models for the rates and used 
the same iterative indirect standardisation algo¬ 
rithm for the parameters. Generalizing Kilpatrick’s 
model to multiple study groups, the age-specific rate 
in age group i and study group j is assumed to be 
9j\i. Treating Aj’s as known, the 6 ^s can be esti¬ 
mated by SMRs; the dj's can then be treated as 
known and the A*’s estimated by SMRs (although 
the indeterminacy identified by Mantel and Stark 
must be resolved, e.g., by normalization of one set 
of parameters). See Holford (1980) for the relation of 
this algorithm to iterative proportional fitting of log- 
linear models in contingency tables. Neither Mantel 
and Stark, Osborn, nor Breslow and Day cited Kil¬ 
patrick or Kermack, McKendrick and McKinlay. 

Logistic Models for Tables of Proportions 

We have seen that Cochran (1954) had suggested 
that analysis of the comparison of two groups with 
respect to a binary response in the presence of a 
confounding factor (an analysis of a 2 x 2 x K con¬ 
tingency table) could be approached by fitting for¬ 
mal models to the 2 x K table of proportions, using 


a transformation such as the logit or probit trans¬ 
formation. But such analyses, given computational 
resources available at that time, were extremely la¬ 
borious. Cochran cited the pioneering work of Dyke 
and Patterson (1952) who developed a method for 
fitting the logit regression model to fitted probabil¬ 
ities of response, tt ijk..., in a table: 

log 7T ‘ jk '" = n + a.i + fa + 7 fe 4- 

1 - TRjfc... 

by maximum likelihood, illustrating this technique 
with an analysis estimating the independent con¬ 
tributions of newspapers, radio, “solid” reading and 
lectures upon knowledge of cancer. Initially they ap¬ 
plied an empirical logit transformation to the ob¬ 
served proportions, Pijk..., and fitted a linear model 
by weighted least squares. They then developed an 
algorithm to refine this solution to the true maxi¬ 
mum likelihood, an algorithm which was later gen¬ 
eralized by Nelder and Wedderburn (1972) to the 
wider class of generalized linear models—the now 
familiar iteratively reweighted least squares (IRLS) 
algorithm. Since, in their example, the initial fit to 
the empirical data provided a good approximation 
to the maximum likelihood solution, only one or 
two steps of the IRLS algorithm were necessary— 
perhaps fortunate since the calculations were per¬ 
formed without recourse to a computer. 

Although, in its title, Dyke and Patterson re¬ 
ferred to their paper as concerning “factorial ar¬ 
rangements”, they explicitly drew attention to its 
uses in dealing with confounding in observational 
studies: 

It is important to realise that with this 
type of data there are likely to be a num¬ 
ber of factors which may influence our es¬ 
timate of the effect of say, solid reading 
but which have not been taken into ac¬ 
count. The point does not arise in the case 
of well conducted experiments but is com¬ 
mon in survey work. 

Log-Linear Models and the National Halothane 
Study 

Systematic theoretical studies of multiple cross¬ 
classifications of discrete data date back at least 
to Yule (1900), quoted above. For three-way tables, 
Bartlett (1935) discussed estimation and hypothe¬ 
sis testing regarding the second-order interaction, 
forcefully followed up by Birch (1963) in his study 
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of maximum likelihood estimation in the three-way 
table. 

However, as will be exemplified below in the con¬ 
text of The National Halothane Study, the real prac¬ 
tical development in the analysis of large contin¬ 
gency tables needed large computers for the nec¬ 
essary calculations. This development largely hap¬ 
pened around 1970 (with many contributions from 
L. A. Goodman in addition to those already men¬ 
tioned), and the dominating method was straightfor¬ 
ward maximum likelihood. Particularly influential 
were the dissertation by Haberman (1974), which 
also included important software, and the authori¬ 
tative monograph by Bishop, Fienberg and Holland 
(1975). 

The National Halothane Study Halothane is an 
anaesthetic which around 1960 was suspected in the 
U.S. for causing increased rates of hepatic necrosis, 
sometimes fatal. A subcommittee under the U.S. 
National Academy of Sciences recommended that 
a large cooperative study be performed, and this 
was started in July 1963. We shall here focus on 
the study of “surgical deaths”, that is, deaths dur¬ 
ing the first 6 weeks after surgery. The study was 
based on retrospective information from 34 partic¬ 
ipating medical centres, who reported all surgical 
deaths during the four years 1959-1962 as well as 
provided information on a random sample of about 
38,000 from the total of about 856,000 operations 
at these centres during the four years. The study 
was designed and analysed in a collaborative effort 
between leading biostatisticians at Stanford Uni¬ 
versity, Harvard University and Princeton Univer¬ 
sity/Bell Labs and the report (Bunker et ah, 1969) 
is unusually rich in explicit discussions about how to 
handle the adjustment problem with the many vari¬ 
ables registered for the patients and the correspond¬ 
ing “thin” cross-classifications. For a very detailed 
and informative review, see Stone (1970). The main 
problem in the statistical analysis was whether the 
different anaesthetics were associated with different 
death rates, after adjusting for a range of possible 
confounders, as we would say today. In a still very 
readable introduction by B. W. Brown et al. it was 
emphasized (page 185) that 

the analysis of rates and counts associated 
with many background variables is a re¬ 
curring and very awkward problem. ... It 
is appropriate to create new methods for 
handling this nearly universal problem at 


just this time. High-speed computers and 
experience with them have now developed 
to such a stage that we can afford to ex¬ 
ecute extensive manipulations repeatedly 
on large bodies of data with many control 
variables, whereas previously such heavy 
arithmetic work was impossible. The pres¬ 
ence of the large sample from the Na¬ 
tional Halothane Study has encouraged 
the investigation and development of flex¬ 
ible methods of adjusting for several back¬ 
ground variables. Although this adjust¬ 
ment problem is not totally solved by the 
work in this Study, substantial advances 
have been made and directions for further 
profitable research are clearly marked. 

The authors here go on to emphasise that the need 
for adjustment is not restricted to “nonrandomised” 
studies. 

Pure or complete randomization does not 
produce either equal or conveniently pro¬ 
portional numbers of patients in each 
class; attempts at deep post-stratification 
are doomed to failure because for several 
variables the number of possible strata 
quickly climbs beyond the thousands. 

... Insofar as we want rates for special 
groups, we need some method of esti¬ 
mation that borrows strength from the 
general pattern of the variables. Such a 
method is likely to be similar, at least 
in spirit, to some of those that were de¬ 
veloped and applied in this Study. At 
some stage in nearly every large-scale, 
randomized field study (a large, random¬ 
ized prospective study of postoperative 
deaths would be no exception), the ques¬ 
tion arises whether the randomization 
has been executed according to plan. In¬ 
evitably, adjustments are required to see 
what the effects of the possible failure of 
the randomization might be. Again, the 
desired adjustments would ordinarily be 
among the sorts that we discuss. 

The National Halothane Study has perhaps be¬ 
come particularly famous among statisticians for 
the early multi-way contingency table analyses 
done by Yvonne M. M. Bishop supervised by F. 
Mosteller. This approach is here termed “smoothed 
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contingency-table analysis”, reflecting the above- 
mentioned recognized need for the analysis to “bor¬ 
row strength from the general pattern”. Bishop did 
her Ph.D. thesis in this area; cf. the journal publica¬ 
tions (Bishop, 1969, 1971) and combined efforts with 
S. E. Fienberg and P. Holland in their very influen¬ 
tial monograph on “Discrete Multivariate Analysis” 
(Bishop, Fienberg and Holland, 1975). But the var¬ 
ious versions of data-analytic (i.e., model-free) gen¬ 
eralizations of standardisation are also of interest, 
at least as showing how broadly these statisticians 
struggled with their task: to adjust discrete data for 
many covariates in the computer age. 

The analysis began with classical standardiza¬ 
tion techniques (L. Moses), which were soon over¬ 
whelmed by the difficulty in adjusting for more 
than one variable at a time. Most of the subsequent 
approaches use a rather special form of stratifica¬ 
tion where the huge, sparse multidimensional con¬ 
tingency table generated by cross-classification of 
covariates other than the primary exposure vari¬ 
ables (the anaesthetic agents) are aggregated to 
yield “strata” with homogeneous death rates, the 
agents subsequently compared by standardizing 
across these strata. Several detailed techniques were 
developed for this purpose by J. W. Tukey and col¬ 
leagues, elaborately documented in the report and 
briefly quoted by Tukey (1979, 1991); however, crit¬ 
icisms were also raised (Stone, 1970; Scott, 1978) 
and the ideas do not seem to have caught on. 

Clogg’s “Purging” of Contingency Tables 

Clifford Clogg was a Ph.D. student of Hauser, 
Goodman and Kitagawa at the University of Chicago, 
writing his dissertation in 1977 on Hauser’s theme 
of using a broader measure of underemployment (as 
opposed to just unemployment) as social indicator, 
in the climate of Goodman’s massive recent efforts 
on loglinear modelling and Kitagawa’s strong tra¬ 
dition in standardisation. We shall briefly present 
Clogg’s attempts at combining the latter two worlds 
in the purging techniques [Clogg (1978), Clogg and 
Eliason (1988) and many other articles]. A useful 
concise summary was provided by Sobel [(1996), 
pages 11-14] in his tribute to Clogg after Clogg’s 
early death, and a recent important discussion and 
generalization was given by Yamaguchi (2011). 

Clogg considered a composition variable C with 
categories i = 1,... ,1, a group variable G with 
categories j = 1 and a dependent variable 

D with categories k = l,...,K. The composition 


variable may itself have been generated by cross¬ 
classification of several composition variables. The 
object is to assess the possible association of D 
with G adjusted for differences in the compositions 
across the groups. Clogg assumed that the three-way 
C x G x D classification has already been modelled 
by a loglinear model, and the purging technique was 
primarily promoted as a tool for increased accessi¬ 
bility of the results of that analysis. Most of the 
time the saturated model is assumed, although in 
our view the purging idea is much easier to assimi¬ 
late when there is no three-factor interaction. 

A brief version of Clogg’s explanation is as follows. 
The I x J x K table is modelled by the saturated 
log-linear model 


_„,C^G,D CG,CD GD CGD 

Tj k hU U A' Rj Rfc ^jk Tj k 


where the disturbing interaction is r, GG ; the compo¬ 
sition-specific rate 


^ij(k) ^ijk j ^ ' T^ijk 

k 


is independent of , but the overall rate of occur¬ 
rence 

r -j(k) = Kijk j ^2 Ttijk = ^-jk/^-j- 
i i,k 


does depend on rk G . 

Now purge iTijk of the cumbersome interaction by 
defining purged proportions proportional to 

^ijk ^ijk/^ij T^ijk ^ ijk! ^ /• 

Actually, 


* * C G D CD GD CGD * / ** 

*ijk = r i T i T j T k T ik T jk T ijk > V = Tl/n... 


that is, the ir*j k specify a model with all the same 
parameters as before except that t gg has been re¬ 
placed by 1. 

Rates calculated from these adjusted proportions 
are now purged of the C xG interaction but all other 
parameters are as before. Clogg noted the fact that 
this procedure is not the same as direct standardi¬ 
sation and defined a variant, marginal CG-purging , 
which is equivalent to direct standardisation. 

Purging was combined with further developments 
of additive decomposition methods by Xie (1989) 
and Liao (1989) and was still mentioned in the text¬ 
book by Powers and Xie (2008), Section 4.6, but 
seems otherwise to have played a modest part in re¬ 
cent decades. A very interesting recent application 
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is by Yamaguchi (2011), who used purging in coun- 
terfactual modelling of the mediation of the salary 
gap between Japanese males and females by factors 
such as differential educational attainment, use of 
part-time jobs and occupational segregation. 

Multiple Regression in Epidemiology 

By the early 1960s epidemiologists, in particular, 
cardiovascular epidemiologists, were wrestling with 
the problem of multiple causes. It was clear that 
methods based solely on cross-classification would 
have limited usefulness. As put by Truett, Cornfield 
and Kannel (1967): 

Thus, if 10 variables are under considera¬ 
tion, and each variable is to be studied at 
only three levels, ... there would be 59,049 
cells in the multiple cross-classification. 

Cornfield (1962) suggested the use of Fisher’s dis¬ 
criminant analysis to deal with such problems. Al¬ 
though initially he considered only two variables, 
he set out the idea more generally. This model as¬ 
sumes that the vector of risk factor values is dis¬ 
tributed, in (incident) cases of a disease and in sub¬ 
jects who remain disease free, as multivariate normal 
variates with different means but equal variance- 
covariance matrices. Reversing the conditioning by 
Bayes theorem shows that the probability of disease 
given risk factors is then given by the multiple lo¬ 
gistic function. The idea was investigated in more 
detail and for more risk factors by Truett, Corn¬ 
field and Kannel (1967) using data from the 12-year 
follow-up of subjects in the Framingham study. A 
clear concern was that the multivariate normal as¬ 
sumption was clearly wrong in the situations they 
were considering, which involved a mixture of con¬ 
tinuous and discrete risk factors. Despite this they 
demonstrated that there was good correspondence 
between observed and expected risks when subjects 
were classified according to deciles of the discrimi¬ 
nant function. 

Truett et al. discussed the interpretation of the 
regression coefficients, at some length, but did not 
remark on the connection with multiplicative mod¬ 
els and odds ratios, although Cornfield had, 15 years 
previously, established the approximate equivalence 
between the odds ratio and a ratio of rates (see Sec¬ 
tion 5). They did note that the model is not additive: 

The relation between logit of risk and risk 
is illustrated in Figure 1 ... a constant in¬ 
crease in the logit of risk does not imply 
a constant increase in risk, 


and preferred to present the coefficients of the mul¬ 
tiple logistic function as multiples of the standard 
deviation of the corresponding variable. They did, 
however, make it clear that these coefficients repre¬ 
sented an estimate of the effect of each risk factor 
after holding all others constant. They singled out 
the effect of weight in this discussion: 

The relative unimportance of weight as a 
risk factor ... when all other risk factors 
are simultaneously considered is notewor¬ 
thy. This is not inconsistent with the pos¬ 
sibility that a reduction in weight would, 
by virtues of its effect on other risk fac¬ 
tors, for example, cholesterol, have impor¬ 
tant effects on the risk of CHD. 

Finally, they noted that the model assumes the ef¬ 
fect of each risk factor to be independent of the levels 
of other risk factors, and noted that first order inter¬ 
actions could be studied by relaxing the assumption 
of equality of the variance-covariance matrices. 

The avoidance of the assumption of multivariate 
normality in the logistic model was achieved by use 
of the method of maximum likelihood. In the epi¬ 
demiological literature, this is usually credited to 
Walker and Duncan (1967) who used a likelihood 
based on conditioning on the values, x, of the risk 
factors, and computing maximum likelihood esti¬ 
mates using the same iteratively reweighted least 
squares algorithm proposed by Dyke and Patterson 
(1952). However, use of maximum likelihood in such 
models had also been anticipated by Cox (1958), 
although he had advocated conditioning both on 
the observed set of risk factors, x, and on the ob¬ 
served values of the disease status indicators, y. This 
is the method, now known as “conditional” logis¬ 
tic regression, which is important in the analysis of 
closely matched case-control studies. Like Truett et 
al., Walker and Duncan gave little attention to inter¬ 
pretation of the regression coefficients, save for ad¬ 
vocating standardization to SD units in an attempt 
to demonstrate the relative importance of different 
factors. The main focus seems to have been in risk 
prediction given multiple risk factors. Cox (1958), 
however, discussed the interpretation of the regres¬ 
sion coefficient of a dichomotous variable as a log 
odds ratio, even applying this to an example, cited 
by Cornfield (1956), concerning smoking and lung 
cancer in a survey of physicians. 

A limitation of logistic regression for the analysis 
of follow-up studies is the necessity to consider, as 
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did Truett, Cornfield and Kannel (1967), a fixed pe¬ 
riod of follow-up. A further rationalization of analyt¬ 
ical methods in epidemiology followed from the re¬ 
alization that such studies generate right-censored, 
and left-truncated, survival data. Mantel (1966) pio¬ 
neered the modern approach to such problems, not¬ 
ing that such data can be treated as if each sub¬ 
ject undergoes a series of Bernoulli trials (of very 
short duration). He suggested, therefore, that the 
comparison of survival between two groups could 
be treated as an analysis of a 2 x 2 x K table in 
which the K “trials” are defined by the time points 
at which deaths occurred in the study (other time 
points being uninformative). In his famous paper, 
Cox (1972), described a regression generalization of 
this idea, in which the instantaneous risk, or “haz¬ 
ard” , is predicted by a log-linear regression model so 
that effects of each risk factor may be expressed as 
hazard ratios. Over subsequent decades this theory 
was further extended to encompass many types of 
event history data. See Andersen et al. (1993) for a 
comprehensive review. 

Confounder Scores and Propensity Scores 

Miettinen (1976b) put forward an alternative pro¬ 
posal for dealing with multiple confounders. It was 
motivated by three shortcomings he identified in the 
multivariate methods then available: 

1 . they (discriminant analysis in particular) relied 
on very dubious assumptions, 

2 . they (logistic regression) were computationally 
demanding by the standards then applying, and 

3. they were poorly understood by substantive 
scientists. 

His proposal was to carry out a preliminary, per¬ 
haps crude, multivariate analysis from which could 
be computed a “confounder score”. This score could 
then be treated as a single confounder and dealt 
with by conventional stratification methods. He sug¬ 
gested two ways of computing the confounder score. 
An outcome function was computed by an initial 
regression (or discriminant function) analysis of the 
disease outcome variable on all of the confounders 
plus the exposure variable of interest, then calculat¬ 
ing the score for a fixed value of exposure so that 
it depended solely on confounders. Alternatively, an 
exposure function could be computed by interchang¬ 
ing the roles of outcome and exposure variables, re¬ 
gressing exposure on confounders plus outcome. 


Rosenbaum and Rubin (1983) later put forward 
a superficially similar proposal to the use of Mietti- 
nen’s exposure function. By analogy with random¬ 
ized experiments, they defined a balancing score as 
a function of potential confounders such that expo¬ 
sure and confounders are conditionally independent 
given the balancing score. Stratification by such a 
score would then simulate a randomized experiment 
within each stratum. They further demonstrated, 
for a binary exposure, that the coarsest possible 
balancing score is the propensity score , the probabil¬ 
ity of exposure conditional upon confounders, which 
can be estimated by logistic regression. Note that, 
unlike Miettinen’s exposure score, the outcome vari¬ 
able is not included in this regression. The impact of 
estimation of the nuisance parameters of the propen¬ 
sity score upon the test of exposure effect was later 
explored by Rosenbaum (1984). Hansen (2008) later 
showed that a balancing score is also provided by the 
“prognostic analogue” to the propensity score which 
is to Miettinen’s outcome function as the propen¬ 
sity score is to his exposure function, that is, the 
exposure variable is omitted when calculating the 
prognostic score. 

Given this later work on balancing scores, it is in¬ 
teresting to note that Miettinen discussed at some 
length why he believed it necessary to include the 
“conditioning variable” (either the exposure of in¬ 
terest or the outcome variable) when computing 
the coefficients of the confounder score, noting that 
the need for this was “puzzling to some epidemiol¬ 
ogists”. His argument comes down to the require¬ 
ment to obtain an (approximately) unbiased esti¬ 
mate of the conditional odds ratio for exposure ver¬ 
sus outcome; omission of the conditioning variable 
means that the confounder score potentially con¬ 
tains a component related to only one of the two 
variables of interest and, owing to noncollapsibility 
of the odds ratio, this leads to a biased estimate 
of the conditional effect. Unfortunately, as demon¬ 
strated by Pike, Anderson and Day (1979), Mietti¬ 
nen’s proposal for correcting this bias comes at the 
cost of inflation of the type 1 error rate for the hy¬ 
pothesis test for an exposure effect. To demonstrate 
this, consider a logistic regression of an outcome, 
y , on an exposure of interest, x, and multiple con¬ 
founders, 2 . Miettinen proposed to first compute a 
confounder score s = 7 T z, where 7 are the coeffi¬ 
cients of 2 in the logistic regression of x on y and z, 
and then to fit the logistic regression of y on x and s. 
While this regression yields an identical coefficient 
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for x as the full logistic regression of y on x and z. 
and has the same maximized likelihood, in the test 
for exposure effect this likelihood is compared with 
the likelihood for the regression of y on s which, in 
general, will be rather less than that for y on z —the 
correct comparison point. Thus, the likelihood ratio 
test in Miettinen’s procedure will be inflated. 

Rosenbaum and Rubin circumvented the estima¬ 
tion problem posed by omission of the conditioning 
variable when calculating balancing scores by esti¬ 
mating a marginal causal effect using direct stan¬ 
dardization with appropriate population weights. 
Equivalently, inverse probability weights based on 
the propensity score can be used. 

Owing to the focus on conditional measures of ef¬ 
fect, the propensity score approach was little used in 
epidemiology during the latter part of the 20th cen¬ 
tury. However, the method has gained considerably 
in popularity over the last decade. For a recent case 
study of treatment effect estimation using propen¬ 
sity score and regression methods, see Kurth et al. 
(2006). They emphasised that, as in classical direct 
standardization, precise identification of the target 
population is important when treatment effects are 
nonuniform. 

Time-Dependent Confounding 

Cox’s life table regression model provided an ex¬ 
ceedingly general approach to modelling the proba¬ 
bility of a failure event conditional upon exposure or 
treatment variables and upon extraneous covariates 
or confounders, the mathematical development ex¬ 
tending quite naturally to allow for variation of such 
variables over time. However, shortly after its publi¬ 
cation, Kalbfleisch and Prentice (1980), pages 124- 
126, pointed out a serious difficulty in dealing with 
“internal” (endogenous) time-dependent covariates. 
Referring to the role of variables such as the gen¬ 
eral condition of patients in therapeutic trials which 
may lie on the causal path between earlier treatment 
and later outcome and, therefore, carry part of the 
causal treatment effect, they wrote: 

A censoring scheme that depends on the 
level of a time dependent covariate z(t) 

(e.g., general condition) is ... not indepen¬ 
dent if z(t) is not included in the model. 

One way to circumvent this is to include 
z(t) in the model, but this may mask 
treatment differences of interest. 


Put another way, to ignore such a variable in the 
analysis is to disregard its confounding effect, but its 
inclusion in the conditional probability model could 
obscure some of the true causal effect of treatment. 

While Kalbfleisch and Prentice had identified a 
fundamental problem with the conditional approach 
to confounder adjustment, they offered no convinc¬ 
ing remedy. This was left to Robins (1986). In this 
and later papers Robins addressed the “healthy 
worker” effect in epidemiology-essentially the same 
problem identified by Kalbfleisch and Prentice. 
Robins proposed two lines of attack which we may 
classify as “marginal” and “conditional” in keep¬ 
ing with a distinction that has come up through¬ 
out our exposition. The original approach was “g- 
computation”, which may be loosely conceived as 
sequential prediction of “what would have hap¬ 
pened” under various specified externally imposed 
“treatments” and thus generalizes (direct) standard¬ 
isation, basically a marginal approach. A further 
development in this direction was inverse prob¬ 
ability weighting of marginal structural models, 
that is, models for the counterfactual outcomes 
(Robins, Hernan and Brumback, 2000). Here, the 
essential idea is to estimate a marginal treatment ef¬ 
fect in a population in which the association between 
treatment (exposure) and the time-dependent co¬ 
variate is removed. Sato and Matsuyama (2003) and 
Vansteelandt and Keiding (2011) gave brief discus¬ 
sions of the relationship between g-computation, in¬ 
verse probability weighting and classical standardis¬ 
ation in the simplest (nonlongitudinal) situation. On 
the other hand, the approach via structural nested 
models [e.g., Robins and Tsiatis (1992), Robins et al. 
(1992)] focusses on the effect of a “blip” of exposure 
at time t conditional on treatments and covariate 
values before t. In the latter models, time-varying 
effect modification may be studied. See the recent 
tutorial surveys by Robins and Hernan (2009) or 
Daniel et al. (2013) for details. 

7. PREDICTION AND TRANSPORTABILITY 

We saw that in the National Halothane Study 
standardisation methods were used analytically, in 
order to control for confounders strictly within the 
frame of the concrete study. The general verdict 
in the emerging computer age regarding this use 
of standardisation was negative, as formulated by 
Fienberg (1975), in a discussion of a careful and de¬ 
tailed survey on observational studies by McKinlay 
(1975): 
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The reader should be aware that stan¬ 
dardization is basically a descriptive tech¬ 
nique that has been made obsolete, for 
most of the purposes to which it has tra¬ 
ditionally been put, by the ready avail¬ 
ability of computer programs for loglinear 
model analysis of multidimensional con¬ 
tingency tables. 

However, the original use of standardization not 
only had this analytical ambition, it also aimed 
at obtaining meaningful generalizations to other 
populations—or other circumstances in the origi¬ 
nal population. Before we sketch the recovery since 
the early 1980s of this aspect of standardization, 
it is useful to record the attitude to generalization 
by influential epidemiologists back then. Miettinen 
[(1985), page 47] in his long-awaited text-book, 
wrote: 

In science the generalization from the ac¬ 
tual study experience is not made to a 
population of which the study experience 
is a sample in a technical sense of proba¬ 
bility sampling ... In science the general¬ 
ization is from the actual study experience 
to the abstract, with no referent in place 
or time, 

and thus did not focus on specific recommenda¬ 
tions as to how to predict precisely what might 
happen under different concrete circumstances. A 
similar attitude was voiced by Rothman [(1986), 
page 95] in the first edition of Modern Epidemiol¬ 
ogy , and essentially repeated in the following edi¬ 
tions of this central reference work [Rothman and 
Greenland (1998), pages 133-134, Rothman, Green¬ 
land and Lash (2008), pages 146-147]. The imme¬ 
diate consequence of this attitude would be that all 
that we need are the parameters in the fitted statis¬ 
tical model and assurance that no bias is present in 
the genesis of the concretely analyzed data. 

However, as we have seen, Clogg (1978) (and later) 
had felt a need for interpreting the log-linear mod¬ 
els in terms of their consequences for summary ta¬ 
bles. Freeman and Holford (1980) wrote a useful 
guide to the new situation for survey analysis where 
the collected data had been analyzed using the 
new statistical models. They concluded that much 
favoured keeping the reporting to the model parame¬ 
ters: these would then be available to other analysts 
for comparative purposes, the model fit was nec¬ 
essary to check for interactions (including possibly 


identifying a model where there is no interaction). 
But, 

in many settings these advantages are 
overshadowed by the dual requirements 
for simplicity of presentation and imme¬ 
diacy of interpretation, 

and Freeman and Holford (1980) therefore gave spe¬ 
cific instructions on how to calculate “summary 
rates” for the total population or other populations. 
The main requirement for validity of such calcula¬ 
tions is that there is no interaction between popula¬ 
tion and composition. 

Interestingly, an influential contribution in 1982 
came from a rather different research environment: 
the well-established agricultural statisticians P. W. 
Lane and J. A. Nelder (Lane and Nelder, 1982). In a 
special issue of Biometrics on the theme “the analy¬ 
sis of covariance”, they wrote a short note containing 
several germs of the later so important potential out¬ 
come view underlying modern causal inference, and 
placed the good old (direct) standardisation tech¬ 
nique right in the middle of it. 

Their view was that the purpose of a statistical 
analysis such as analysis of covariance is not only 
to estimate parameters, but also to make what they 
called predictions'. 

An essential feature is the division into 
effects of interest and effects for which 
adjustment is required. ... For example, 
a typical prediction from a variety trial 
is the yield that would have been ob¬ 
tained from a particular variety if it had 
been grown over the whole experimen¬ 
tal area. When a covariate exists the ad¬ 
justed treatment mean can be thought of 
as the prediction of the yield of that va¬ 
riety grown over the whole experimental 
area with the covariate fixed at its mean 
value. ... The predictions here are not of 
future events but rather of what would 
have happened in the experiment if other 
conditions had prevailed. In fact no vari¬ 
ety would have been grown over the whole 
experimental area nor would the covariate 
have been constant. 

Lane and Nelder proposed to use the term pre¬ 
dictive margin for such means, avoiding the term 
“population treatment mean” suggested by Searle, 
Speed and Milliken (1980) to replace the cryptic 
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SAS-output term “least square means”. Lane and 
Nelder emphasised that these means might either 
be 

conditional on the value we take as stan¬ 
dard for the covariate 

or 

marginal to the observed distribution of 

covariate values, 

and Lane and Nelder went on to explain to this 
new audience (including agricultural statisticians) 
that there exist many other possibilities for choice 
of standard. 

We find it interesting that Lane and Nelder used 
the occasion of the special issue of Biometrics on 
analysis of covariance to point out the similari¬ 
ties to standardisation, and to phrase their “pre¬ 
diction” in much similar terms as the later causal 
analysis would do. Of course, it should be remem¬ 
bered that Lane and Nelder manoeuvered within the 
comfortable framework of randomised field trials. 
Rothman, Greenland and Lash [(2008), page 386 ff.] 
described how these ideas have developed into what 
is now termed regression standardisation. 

An Example: Cancer Trends 

A severe practical limitation of the modelling ap¬ 
proach is that the model must encompass all the 
data to be compared. However, many official statis¬ 
tics are published explicitly to allow comparisons 
with other published series. Even within a single 
publication it may be inappropriate to fit a single 
large and complex model across the entire data set. 

An example of the latter situation is the I.A.R.C. 
monograph on Trends in Cancer Incidence and Mor¬ 
tality (Coleman et ah, 1993). The primary aim of 
this monograph was to estimate cancer trends across 
the population-based cancer registries throughout 
the world and this was addressed by fitting age- 
period-cohort models to the data from each registry. 
But comparisons of rates between registries at spe¬ 
cific time points were also required and, since the age 
structures of different registries differed markedly, 
direct standardisation to the world population, ages 
30-74, was used. However, some of the cancers con¬ 
sidered were rare and this exposes a problem with 
the method of direct standardization—that it can be 
very inefficient when the standard population differs 
markedly from that of the test group. The authors 
therefore chose to apply direct standardisation to 
the fitted rates from the age-period-cohort models. 


Transportability Across Studies 

Pearl and Barenboim (2012) noted that precise 
conditions for applying concrete results obtained in 
a study environment to another “target” environ¬ 
ment, 

remarkably... have not received system¬ 
atic formal treatment. .. The standard lit¬ 
erature on this topic ... consists primar¬ 
ily of “threats”, namely verbal narratives 
of what can go wrong when we try to 
transport results from one study to an¬ 
other. .. Rarely do we find an analysis of 
“licensing assumptions”, namely, formal 
and transparent conditions under which 
the transport of results across differing en¬ 
vironments or populations is licensed from 
first principles. 

After further outlining the strong odds against 
anyone who dares formulate such conditions, Pearl 
and Barenboim then set out to propose one such for¬ 
malism, based on the causal diagrams developed by 
Pearl and colleagues over the last decades; cf. Pearl 
(2009). 

In the terminology of Pearl and Barenboim, the 
method of direct standardisation, together with the 
“predictions” of Lane and Nelder, is a transport for¬ 
mula and, as they state, 

the choice of the proper transport formula 
depends on the causal context in which 
population differences are embedded. 

Although a formal treatment of these issues is 
overdue, it has been recognized in epidemiology for 
many years that the concept of confounding cannot 
be defined solely in terms of a third variable being 
related to both outcome and exposure of interest. A 
landmark paper was that of Simpson (1951) which 
dealt with the problem of interpreting associations 
in three-way contingency tables. As we saw in Sec¬ 
tion 3, although “Simpson’s paradox” is widely re¬ 
garded as synonymous with Yule’s paradox, Simp¬ 
son’s primary concern was the role of the causal con¬ 
text in deciding whether the conditional or marginal 
association between two of the three factors in a ta¬ 
ble is of primary interest. The point has been under¬ 
stood by many (if not all) epidemiologists writing in 
the second half of the 20th century as, for example, 
is demonstrated by the remark of Truett, Cornfield 
and Kannel (1967), cited in Section 6, concerning 
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interpretation of the coefficient of body weight in 
their regression equation for coronary disease inci¬ 
dence. However, as far as we can tell, the issue does 
not seem to have concerned 19th century writers; 
for example, no consideration seems to have been 
given to the possibility that age differences between 
populations could, in part, be a consequence of dif¬ 
ferences in “the force of mortality” and, if so, the 
implication for age standardization. 

8. CONCLUSION 

In the fields of scientific enquiry with which we 
are concerned here, the causal effect of a treatment, 
or exposure, cannot be observed at the individual 
level. Instead, the effect measures we use contrast 
the distributions of responses in populations with 
differing exposures, but in which the distributions 
of other factors do not differ. In randomized studies, 
this equality of distribution of extraneous factors is 
guaranteed by randomisation and causal effects are 
simply measured. In observational studies, however, 
differences between the distributions of relevant ex¬ 
traneous factors between exposure groups (what epi¬ 
demiologists call “confounding”) is ubiquitous and 
we must rely on the assumption of “no unmeasured 
confounders” to allow us to estimate the causal ef¬ 
fect. 

In much recent work, the problem is approached 
by postulating that each individual has a number 
of potential responses, one for each possible expo¬ 
sure; only one of these is observed, the other coun- 
terfactual responses being assumed to be “missing 
at random” given measured confounders. Alterna¬ 
tively, we can restrict ourselves to dealing with ob¬ 
served outcomes, assuming that the mechanism by 
which exposure was allocated in the experiment of 
nature we have observed did not depend on unmea¬ 
sured confounders. The choice between these posi¬ 
tions is philosophical and, to the applied statistician, 
largely a matter of convenience. The more serious 
concern, with which we have been largely concerned 
in this review, is the choice of effect measure; we can 
choose to contrast the marginal distribution of re¬ 
sponses under equality of distribution of extraneous 
factors or to contrast response distributions which 
condition on the values of these factors. 

Standardisation grew up in response to obvious 
problems of age-confounding in actuarial (18th cen¬ 
tury) and demographic (19th century) comparative 
studies of mortality. The simple intuitive calcula¬ 
tions considered scenarios in which either the age 


distributions did not differ (“direct” standardisa¬ 
tion) or age-specific rates did not differ (“indirect” 
standardisation) between study groups. However, 
formal consideration of such indices as effect mea¬ 
sures came later, the contribution of Yule (1934) be¬ 
ing noteworthy. 

There would probably be widespread agreement 
that describing causal effects in relation to all po¬ 
tential causes, as in the conditional approach, must 
represent the most complete analysis of a data set 
and we have described how this approach developed 
throughout the 20th century, starting with the in¬ 
fluential paper of Yule (1900). This impressive pa¬ 
per clearly described the proliferation of “partial” 
association measures introduced by the conditional 
approach, and drew attention to the consequent 
need to use measures which remain relatively sta¬ 
ble across subgroups. In later work Yule (1934) re¬ 
visited classical standardisation in terms of an aver¬ 
age of conditional (i.e., stratum-specific) measures. 
Such early work sowed the seeds of the “statisti¬ 
cal” approach based on formal probability models, 
leading eventually to the widespread use of logis¬ 
tic regression and proportional hazard (multiplica¬ 
tive intensity) models, the contributions of Cochran 
(1954) and Mantel and Haenszel (1959) providing 
important staging posts along the way (even though 
the latter authors explicitly denied any reliance on a 
model). By the end of the century such approaches 
dominated epidemiology and biostatistics. 

Toward the end of the 20th century, the use of 
marginal measures of causal effects emerged from 
the counterfactual approach to causal analysis in 
the social sciences, the idea of propensity scores 
(Rosenbaum and Rubin, 1983) being particularly 
influential. However, these methods only found their 
way into mainstream biostatistics when applications 
arose for which conventional conditional probability 
models were not well suited, the foremost of which 
being the problem of time-dependent confounding. 
Whereas, for simple problems, the parameters of lo¬ 
gistic and multiplicative intensity models have an 
interpretation as measures of (conditional) causal 
effects, Kalbfleisch and Prentice (1980) noted that 
this ceased to be the case in the presence of time- 
dependent confounding. While the statistical mod¬ 
elling approach can be extended by joint modelling 
of the event history and confounder trajectories, 
such models are complex and, again, causal effects 
do not correspond with model parameters and would 
need to be estimated by simulations based on the 
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fitted model. Such models continue to be studied 
[see, e.g., Henderson, Diggle and Dobson (2000)], 
but, although it could be argued that these offer 
the opportunity for a more detailed understanding 
of the nature of causal effects in this setting, the 
simpler marginal approaches pioneered by Robins 
(1986) are more attractive in most applications. The 
success of this latter approach in offering a solution 
to a previously intractable problem encouraged bio¬ 
statisticians to further explore methods for estima¬ 
tion of marginal causal effects; for example, propen¬ 
sity scores are now widely used in this literature. 

So where are we today? Both approaches have 
strengths and weaknesses. The conditional mod¬ 
elling approach relies on the assumption of homo¬ 
geneity of effect across subgroups or, when this fails 
to hold, to a multiplicity of effect measures. The 
marginal approach, while seemingly less reliant on 
such assumptions, encounters the same issue when 
considering the transportability of effect measures 
to different populations. 
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